JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.5k stars 713 forks source link

UnicodeEncodeError on Windows command prompt when UTF-8 output is produced #122

Closed AnaMenezes01 closed 4 years ago

AnaMenezes01 commented 4 years ago

Hi,

I am running the following code: snscrape --format {date},{url},{user.location} twitter-search "covid since:2020-06-01 until:2020-06-02" > June_01.csv

But I am geeting the following error: UnicodeEncodeError: 'charmap' codec can't encode characters in position 106-107: character maps to <undefined>

I had no issues running -jsonl, but processing the whole data took a long time. So, it would be great if I could extract just what I need with --format.

Thank you in advance!

JustAnotherArchivist commented 4 years ago

Unless you can tell me which tweet caused this (should be in the dump file), I won't be able to look into this until Twitter fixes their search (#123).

JustAnotherArchivist commented 4 years ago

I just ran that exact command with the current dev version. No error, 430994 results.

JustAnotherArchivist commented 4 years ago

Are you using Windows's command prompt? According to a web search, the error mostly comes up in relation to that because cmd does not use UTF-8 by default.

AnaMenezes01 commented 4 years ago

Hi @JustAnotherArchivist,

Yes, I am on Windows. How should I go to fix this issue? I have version snscrape 0.3.5.dev66+gc4a5715

And to answer your previous question, the code stop running at case 17.

Thank you very much!

JustAnotherArchivist commented 4 years ago

The internet suggests that running chcp 65001 before snscrape might fix it, but I can't vouch for that since I (fortunately) haven't used Windows in many years.

I'm closing this issue since it's an issue with the environment, not snscrape (or even Python), but feel free to ask further questions if needed; maybe someone else can help you.

AnaMenezes01 commented 4 years ago

Hi @JustAnotherArchivist,

Thank you very much for the help!

For anyone that runs into the same issue as I did, follow the instructions here to fix it: https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window