DocNow / hydrator

Turn Tweet IDs into Twitter JSON & CSV from your desktop!
MIT License
430 stars 62 forks source link

Encoding problem #69

Open LeonardoSanna opened 3 years ago

LeonardoSanna commented 3 years ago

Hello, I've a pretty large dataset (> 2 TB) split in six files.

I assumed that UTF-8 were the text encoding of jsonl files. However there are some charachters that apparently are non-UTF.8 and this causes R to fail when I specify the encoding.

Not specifying the encoding results in a messy full_text output

edsu commented 3 years ago

Do you have an example?

LeonardoSanna commented 3 years ago

Update, I found I walkaround:

1) Import the file in R withouth encoding specification 2) clean data 3) export on UTF-8 csv.

The problem was with the function stream_in producing the error Error in FUN(X[[i]], ...) : invalid multibyte string, element 1 while streaming the json file in a dataframe.

Not specifying while importing solves the issue, though fileEncoding = "UTF-8" must be specified while writing on the outfile

However there are still some weird charachters under "full text" I think because of emojis

These are unicode emojis and I'm ok with that RT @ScottAnthonyUSA: <U+26A0><U+FE0F> IT SHOULD BE NOTED that the CDC initially had an embargo placed on CDC testimony. The TRUMP ADMINISTRATION LIFTED…

But what about this? iOS emoji?

RT @StocksUnhinged: $SPY $AAL $APT $EXPE $GOOG $DAL $UAL $BA $LAKE $YUM $CMG $HUM $CI #CDC expected to announce first US case of #Wuhan…

edsu commented 3 years ago

If you can give me a tweet id that will help me test.

LeonardoSanna commented 3 years ago

If you can give me a tweet id that will help me test.

1219771346768596992 this the one with the dollar signs