DocNow / twarc-csv

A plugin for twarc2 for converting tweet JSON into DataFrames and exporting to CSV.
MIT License
31 stars 10 forks source link

twarc2 csv _csv.Error: need to escape, but no escapechar set #37

Closed rogerschoen closed 1 year ago

rogerschoen commented 2 years ago

Hi, I'm trying to run twarc-csv on a jsonl file obtained through the Academic API. Using a new macbook pro with M1 chip. I run this command: twarc2 csv result.jsonl result.csv

It gets stuck at 37% every time with the output below. Is this a known error? Am I doing something wrong? Thank you in advance.

37%|█████▉ | Processed 286M/766M of input file [00:33<00:37, 13.5MB/s]Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.10/bin/twarc2", line 8, in sys.exit(twarc2()) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1128, in call return self.main(args, kwargs) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1659, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, ctx.params) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 754, in invoke return __callback(args, **kwargs) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/twarc_csv.py", line 148, in csv writer.process() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/csv_writer.py", line 81, in process self._write_output(self.converter.process(batch), first_batch) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/csv_writer.py", line 65, in _write_output _df.to_csv( File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/generic.py", line 3466, in to_csv return DataFrameRenderer(formatter).to_csv( File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/format.py", line 1105, in to_csv csv_formatter.save() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 257, in save self._save() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 262, in _save self._save_body() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 300, in _save_body self._save_chunk(start_i, end_i) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 311, in _save_chunk libwriters.write_csv_rows( File "pandas/_libs/writers.pyx", line 72, in pandas._libs.writers.write_csv_rows _csv.Error: need to escape, but no escapechar set 37%|█████▉ | Processed 286M/766M of input file [00:33<00:56, 8.83MB/s]

igorbrigadir commented 2 years ago

I'll have a look! Any way you can send on a small sample of the file, where the error occurs? Does twarc.log have the error line? Also for reference, what version of pandas do you have? pip list should show all.

rogerschoen commented 2 years ago

Hi Igor,

Thanks so much for your reply and in advance for your help.

I have Pandas 1.3.4

Twarc log has only

I’ve cut and pasted a small piece of the jsonl below:

{"data": [{"source": "Twitter for Android", "conversation_id": "1458495011184680976", "entities": {"urls": [{"start": 41, "end": 64, "url": "https://t.co/mZbudWn3eK", "expanded_url": "https://twitter.com/OlivierCadic/status/1458485325328748547", "display_url": "twitter.com/OlivierCadic/s\u2026"}], "hashtags": [{"start": 8, "end": 25, "tag": "TaiwanIsNotChina"}, {"start": 26, "end": 40, "tag": "Taiwancanhelp"}]}, "text": "Merci ! #TaiwanIsNotChina #Taiwancanhelp https://t.co/mZbudWn3eK", "lang": "fr", "author_id": "759942703", "context_annotations": [{"domain": {"id": "45", "name": "Brand Vertical", "description": "Top level entities that describe a Brands industry"}, "entity": {"id": "781974597310615553", "name": "Entertainment"}}, {"domain": {"id": "46", "name": "Brand Category", "description": "Categories within Brand Verticals that narrow down the scope of Brands"}, "entity": {"id": "781974597105094656", "name": "TV/Movies Related"}}, {"domain": {"id": "47", "name": "Brand", "description": "Brands and Companies"}, "entity": {"id": "1148917000657260548", "name": "FRANCE 24 (\u0641\u0631\u0627\u0646\u0633 24)"}}], "public_metrics": {"retweet_count": 0, "reply_count": 0, "like_count": 0, "quote_count": 0}, "id": "1458495011184680976", "referenced_tweets": [{"type": "quoted", "id": "1458485325328748547"}], "possibly_sensitive": false, "reply_settings": "everyone", "created_at": "2021-11-10T18:01:22.000Z"}, {"source": "Twitter for Android", "conversation_id": "1458489858964475913", "entities": {"urls": [{"start": 65, "end": 88, "url": "https://t.co/axB6MOO7UK", "expanded_url": "https://twitter.com/Taiwan_in_UK/status/1458471909709000714", "display_url": "twitter.com/Taiwan_in_UK/s\u2026"}], "hashtags": [{"start": 0, "end": 21, "tag": "TogetherForOurPlanet"}, {"start": 23, "end": 38, "tag": "OneStepGreener"}, {"start": 40, "end": 47, "tag": "UNFCCC"}, {"start": 48, "end": 62, "tag": "TaiwanCanHelp"}]}, "text": "#TogetherForOurPlanet \n#OneStepGreener \n#UNFCCC #TaiwanCanHelp\ud83c\uddf9\ud83c\uddfc https://t.co/axB6MOO7UK", "lang": "und", "author_id": "1072845631830585344", "public_metrics": {"retweet_count": 0, "reply_count": 0, "like_count": 3, "quote_count": 0}, "id": "1458489858964475913", "referenced_tweets": [{"type": "quoted", "id": "1458471909709000714"}], "possibly_sensitive": false, "geo": {"place_id": "7d588036fe12e124"}, "reply_settings": "everyone", "created_at": "2021-11-10T17:40:54.000Z"}, {"source": "Twitter for Android", "conversation_id": "1458477537630240773", "text": "RT @ljmabon: Look what I found! Taiwan has a history of sponsoring the public transport infrastructure in the COP host cities, and Glasgow\u2026", "lang": "en", "entities": {"mentions": [{"start": 3, "end": 11, "username": "ljmabon", "id": "547968016"}], "annotations": [{"start": 32, "end": 37, "probability": 0.9737, "type": "Place", "normalized_text": "Taiwan"}, {"start": 131, "end": 137, "probability": 0.9551, "type": "Place", "normalized_text": "Glasgow"}]}, "author_id": "1391310739210653696", "public_metrics": {"retweet_count": 4, "reply_count": 0, "like_count": 0, "quote_count": 0}, "id": "1458477537630240773", "referenced_tweets": [{"type": "retweeted", "id": "1458411117403942916"}], "possibly_sensitive": false, "reply_settings": "everyone", "created_at": "2021-11-10T16:51:56.000Z"}, {"source": "Twitter Web App", "conversation_id": "1458464952075833350", "text": "RT @ljmabon: Look what I found! Taiwan has a history of sponsoring the public transport infrastructure in the COP host cities, and Glasgow\u2026", "lang": "en", "entities": {"mentions": [{"start": 3, "end": 11, "username": "ljmabon", "id": "547968016"}], "annotations": [{"start": 32, "end": 37, "probability": 0.9737, "type": "Place", "normalized_text": "Taiwan"}, {"start": 131, "end": 137, "probability": 0.9551, "type": "Place", "normalized_text": "Glasgow"}]}, "author_id": "819520019335880704", "public_metrics": {"retweet_count": 4, "reply_count": 0, "like_count": 0, "quote_count": 0}, "id": "1458464952075833350", "referenced_tweets": [{"type": "retweeted", "id": "1458411117403942916"}], "possibly_sensitive": false, "reply_settings": "everyone", "created_at": "2021-11-10T16:01:56.000Z"}, {"source": "Twitter for Android", "conversation_id": "1458441661395132429", "text": @.*** @mikepompeo @USMC Don't be serious \ud83d\ude05 \nMAGA! \nTrump2024\n#TaiwanCanHelp", "lang": "en", "entities": {"mentions": [{"start": 0, "end": 15, "username": "DeeDee17235993", "id": "1300943415002177537"}, {"start": 16, "end": 27, "username": "mikepompeo", "id": "1163992520252153857"}, {"start": 28, "end": 33, "username": "USMC", "id": "10126672"}], "hashtags": [{"start": 71, "end": 85, "tag": "TaiwanCanHelp"}]}, "author_id": "754630258658160640", "context_annotations": [{"domain": {"id": "45", "name": "Brand Vertical", "description": "Top level entities that describe a Brands industry"}, "entity": {"id": "781974596157251587", "name": "Government/Education"}}, {"domain": {"id": "46", "name": "Brand Category", "description": "Categories within Brand Verticals that narrow down the scope of Brands"}, "entity": {"id": "781974597226729473", "name": "Non-profit"}}, {"domain": {"id": "47", "name": "Brand", "description": "Brands and Companies"}, "entity": {"id": "10024011845", "name": "U.S. Marine Corps"}}], "public_metrics": {"retweet_count": 0, "reply_count": 0, "like_count": 1, "quote_count": 0}, "id": "1458443194761289728", "referenced_tweets": [{"type": "replied_to", "id": "1458442838946066443"}], "possibly_sensitive": false, "reply_settings": "everyone", "in_reply_to_user_id": "1300943415002177537", "created_at": "2021-11-10T14:35:28.000Z"}, {"source": "Twitter for iPhone", "conversation_id": "1458431390857396226", "entities": {"urls": [{"start": 248, "end": 271, "url": "https://t.co/U30O5STIjx", "expanded_url": "https://time.com/collection/best-inventions-2021/6112621/paper-shoot-camera/", "display_url": "time.com/collection/bes\u2026", "images": [{"url": "https://pbs.twimg.com/news_img/1458414004112560135/khRXfZlb?format=jpg&name=orig", "width": 1024, "height": 512}, {"url": "https://pbs.twimg.com/news_img/1458414004112560135/khRXfZlb?format=jpg&name=150x150", "width": 150, "height": 150}], "status": 200, "title": "Paper Shoot Camera: The 100 Best Inventions of 2021", "description": "Find out why Paper Shoot Camera made this year's list", "unwound_url": "https://time.com/collection/best-inventions-2021/6112621/paper-shoot-camera/"}], "annotations": [{"start": 168, "end": 177, "probability": 0.8248, "type": "Person", "normalized_text": "George Lin"}, {"start": 182, "end": 187, "probability": 0.8295, "type": "Place", "normalized_text": "Taiw\u00e1n"}], "hashtags": [{"start": 11, "end": 16, "tag": "time"}, {"start": 47, "end": 58, "tag": "papershoot"}, {"start": 190, "end": 203, "tag": "madeintaiwan"}, {"start": 204, "end": 220, "tag": "papershootspain"}, {"start": 221, "end": 235, "tag": "taiwancanhelp"}, {"start": 236, "end": 247, "tag": "fotografia"}]}, "text": "La revista #time ha considerado nuestra c\u00e1mara #papershoot una de las mejores invenciones del a\u00f1o 2021 en todo el mundo. \u201cUna c\u00e1mara para cambiar el mundo\u201d nuestro CEO George Lin en Taiw\u00e1n. #madeintaiwan #papershootspain #taiwancanhelp #fotografia https://t.co/U30O5STIjx", "lang": "es", "author_id": "1379397731555287041", "context_annotations": [{"domain": {"id": "30", "name": "Entities [Entity Service]", "description": "Entity Service top level domain, every item that is in Entity Service should be in this domain"}, "entity": {"id": "847868745150119936", "name": "Home & family", "description": "Hobbies and interests"}}, {"domain": {"id": "67", "name": "Interests and Hobbies", "description": "Interests, opinions, and behaviors of individuals, groups, or cultures; like Speciality Cooking or Theme Parks"}, "entity": {"id": "847869714860605440", "name": "Photography", "description": "Photography"}}], "public_metrics": {"retweet_count": 0, "reply_count": 0, "like_count": 1, "quote_count": 1}, "id": "1458431390857396226", "possibly_sensitive": false, "reply_settings": "everyone", "created_at": "2021-11-10T13:48:34.000Z"}, {"source": "Twitter for iPhone", "conversation_id": "1458425461743054848", "text": "RT @ljmabon: Look what I found! Taiwan has a history of sponsoring the public transport infrastructure in the COP host cities, and Glasgow\u2026", "lang": "en", "entities": {"mentions": [{"start": 3, "end": 11, "username": "ljmabon", "id": "547968016"}], "annotations": [{"start": 32, "end": 37, "probability": 0.9737, "type": "Place", "normalized_text": "Taiwan"}, {"start": 131, "end": 137, "probability": 0.9551, "type": "Place", "normalized_text": "Glasgow"}]}, "author_id": "587761233", "public_metrics": {"retweet_count": 4, "reply_count": 0, "like_count": 0, "quote_count": 0}, "id": "1458425461743054848", "referenced_tweets": [{"type": "retweeted", "id": "1458411117403942916"}], "possibly_sensitive": false, "reply_settings": "everyone", "created_at": "2021-11-10T13:25:00.000Z"}, {"source": "Twitter Web App", "conversation_id": "1458411745819648001", "text": "RT @ljmabon: Look what I found! Taiwan has a history of sponsoring the public transport infrastructure in the COP host cities, and Glasgow\u2026", "lang": "en", "entities": {"mentions": [{"start": 3, "end": 11, "username": "ljmabon", "id": "547968016"}], "annotations": [{"start": 32, "end": 37, "probability": 0.9737, "type": "Place", "normalized_text": "Taiwan"}, {"start": 131, "end": 137, "probability": 0.9551, "type": "Place", "normalized_text": "Glasgow"}]}, "author_id": "5045421", "public_metrics": {"retweet_count": 4, "reply_count": 0, "like_count": 0, "quote_count": 0}, "id": "1458411745819648001", "referenced_tweets": [{"type": "retweeted", "id": "1458411117403942916"}], "possibly_sensitive": false, "reply_settings": "everyone", "created_at": "2021-11-10T12:30:30.000Z"}, {"attachments": {"media_keys": ["3_1458411111787675653"]}, "source": "Twitter for iPhone", "conversation_id": "1458411117403942916", "entities": {"urls": [{"start": 178, "end": 201, "url": "https://t.co/Q1E099cMG8", "expanded_url": "https://twitter.com/ljmabon/status/1458411117403942916/photo/1", "display_url": "pic.twitter.com/Q1E099cMG8"}], "annotations": [{"start": 19, "end": 24, "probability": 0.9689, "type": "Place", "normalized_text": "Taiwan"}, {"start": 118, "end": 124, "probability": 0.9517, "type": "Place", "normalized_text": "Glasgow"}], "hashtags": [{"start": 126, "end": 132, "tag": "COP26"}, {"start": 163, "end": 177, "tag": "TaiwanCanHelp"}]}, "text": "Look what I found! Taiwan has a history of sponsoring the public transport infrastructure in the COP host cities, and Glasgow #COP26 is no exception. Great work! #TaiwanCanHelp https://t.co/Q1E099cMG8", "lang": "en", "author_id": "547968016", "public_metrics": {"retweet_count": 4, "reply_count": 0, "like_count": 20, "quote_count": 0}, "id": "1458411117403942916", "possibly_sensitive": false, "reply_settings": "everyone", "created_at": "2021-11-10T12:28:01.000Z"}, {"source": "Twitter Web App", "conversation_id": "1458283964683096066", "entities": {"urls": [{"start": 203, "end": 226, "url": "https://t.co/m0x05YL1Pf", "expanded_url": "https://journalnews.com.ph/combating-cybercrime-in-the-post-pandemic-era-taiwan-can-help/", "display_url": "journalnews.com.ph/combating-cybe\u2026"}], "mentions": [{"start": 184, "end": 200, "username": "JournalOnlinePH", "id": "3768609792"}], "hashtags": [{"start": 11, "end": 22, "tag": "cybercrime"}, {"start": 30, "end": 43, "tag": "postpandemic"}, {"start": 49, "end": 63, "tag": "Taiwancanhelp"}, {"start": 110, "end": 119, "tag": "INTERPOL"}, {"start": 120, "end": 127, "tag": "Taiwan"}, {"start": 130, "end": 137, "tag": "Police"}, {"start": 138, "end": 147, "tag": "Pandemic"}, {"start": 148, "end": 156, "tag": "COVID19"}]}, "text": "\ud83d\udce3Combating #cybercrime in the #postpandemic era: #Taiwancanhelp\n\n(\u83f2\u5f8b\u8cd3\u5a92\u9ad4\u5168\u6587\u520a\u767b\u5167\u653f\u90e8\u8b66\u653f\u7f72\u5211\u4e8b\u8b66\u5bdf\u5c40\u9ec3\u5c40\u9577\u5609\u797f\u95dc\u65bc\u6211\u570b\u53c3\u8207\u300c\u570b\u969b\u5211\u8b66\u7d44\u7e54\u300d\u5c08\u6587)\n\n#INTERPOL #Taiwan\ud83c\uddf9\ud83c\uddfc #Police #Pandemic #COVID19\n\n(11/9\u4eba\u6c11\u665a\u5831People\u2019s Tonight @JournalOnlinePH)\n\nhttps://t.co/m0x05YL1Pf", "lang": "zh", "author_id": "1231776088834985989", "context_annotations": [{"domain": {"id": "65", "name": "Interests and Hobbies Vertical", "description": "Top level interests and hobbies groupings, like Food or Travel"}, "entity": {"id": "848920371311001600", "name": "Technology", "description": "Technology and computing"}}, {"domain": {"id": "30", "name": "Entities [Entity Service]", "description": "Entity Service top level domain, every item that is in Entity Service should be in this domain"}, "entity": {"id": "898650876658634752", "name": "Cybersecurity", "description": "Cybersecurity"}}, {"domain": {"id": "123", "name": "Ongoing News Story", "description": "Ongoing News Stories like 'Brexit'"}, "entity": {"id": "1220701888179359745", "name": "COVID-19"}}], "public_metrics": {"retweet_count": 0, "reply_count": 0, "like_count": 0, "quote_count": 0}, "id": "1458283964683096066", "possibly_sensitive": false, "reply_settings": "everyone", "created_at": "2021-11-10T04:02:45.000Z"}, {"attachments": {"media_keys": ["3_1458109564155756544"]},

On Nov 17, 2021, at 2:02 PM, Igor Brigadir @.***> wrote:

I'll have a look! Any way you can send on a small sample of the file, where the error occurs? Does twarc.log have the error line? Also for reference, what version of pandas do you have? pip list should show all.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DocNow/twarc-csv/issues/37#issuecomment-972119832, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWRN4EEK7RNVTIHZU4JPFI3UMQRABANCNFSM5IHXV6JQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

rogerschoen commented 2 years ago

Hi Igor, I realize I forgot to paste the info from Twarc.log. All I have there is: 2021-11-21 20:54:26,172 INFO using config /Users/rogerschoen/Library/Application Support/twarc/config 2021-11-21 20:54:26,173 INFO creating HTTP session headers for app auth.

Same problem when I try other files. Any help would be greatly appreciated. I'm a newbie with R and completely stuck at getting the jsonl created by Twarc into it.

igorbrigadir commented 2 years ago

Thanks! I still can't reproduce it, but maybe with a larger sample I could get it? I have a feeling it's a relatively straight forward fix once the awkward element that throws the error is identified.

igorbrigadir commented 2 years ago

I think this happens with python 3.10, following up here too: https://twittercommunity.com/t/trouble-working-with-twarc2-jsonl/162527/4?u=igorbrigadir so the solution might be to use an older 3.9 or 3.8 python version, but I need to test this and try to make it compatible with 3.10

rogerschoen commented 2 years ago

Thanks! I’m trying your suggestion with Python 3.9.1 Still working on getting that set up with pyenv. I’ll let you know if it works. Thank you again for your help.

On Nov 27, 2021, at 5:36 PM, Igor Brigadir @.***> wrote:

I think this happens with python 3.10, following up here too: https://twittercommunity.com/t/trouble-working-with-twarc2-jsonl/162527/4?u=igorbrigadir https://twittercommunity.com/t/trouble-working-with-twarc2-jsonl/162527/4?u=igorbrigadir so the solution might be to use an older 3.9 or 3.8 python version, but I need to test this and try to make it compatible with 3.10

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DocNow/twarc-csv/issues/37#issuecomment-980819643, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWRN4EADDXAWIKLY3FN633DUOGBQHANCNFSM5IHXV6JQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

DerekHilCameron commented 1 year ago

Any updates on this, is the best option still to run in 3.9?

igorbrigadir commented 1 year ago

I haven't tested it with 3.10 or 3.11 yet, but will do - so for now, best way is to make a 3.9 environment i think! If you manage to get it to work in 3.11 or 3.10 PRs welcome!

igorbrigadir commented 1 year ago

Now that some time has passed, is this still happening? @DerekHilCameron @rogerschoen

I got some time to dig in but couldn't reproduce with Python 3.11.0, 3.10.8, or 3.9.15. I don't have a Mac though, so if that's the issue it would be great to have someone help debug that.

The pip list i end up with is:

Package            Version
------------------ ---------
certifi            2022.9.24
charset-normalizer 2.1.1
click              8.1.3
click-config-file  0.6.0
click-plugins      1.1.1
configobj          5.0.6
humanize           4.4.0
idna               3.4
more-itertools     9.0.0
numpy              1.23.4
oauthlib           3.2.2
pandas             1.5.1
pip                22.3.1
python-dateutil    2.8.2
pytz               2022.6
requests           2.28.1
requests-oauthlib  1.3.1
setuptools         58.1.0
six                1.16.0
tqdm               4.64.1
twarc              2.12.0
twarc-csv          0.6.0
urllib3            1.26.12

Since a new version of pandas is also here, it's worth trying again to see if this error still exists

DerekHilCameron commented 1 year ago

Hi Igor, I am running a job here with Sam's suggestion in the #661 thread first, but can fiddle around with it.

DerekHilCameron commented 1 year ago

Okay, I set the environment with these packages and versions. I am going to retry an earlier job that failed. There were a couple packages that did need updating. I think I had incompatible charset-normalizer and idna versions. Fingers crossed.

DerekHilCameron commented 1 year ago

@igorbrigadir This fixed this issue. Must have just had a slight mismatch in packages contributing to it.

igorbrigadir commented 1 year ago

Great! feel free to reopen if it crops up again!