Open LeonardoSanna opened 3 years ago
Do you have an example?
Update, I found I walkaround:
1) Import the file in R withouth encoding specification 2) clean data 3) export on UTF-8 csv.
The problem was with the function stream_in
producing the error Error in FUN(X[[i]], ...) : invalid multibyte string, element 1
while streaming the json file in a dataframe.
Not specifying while importing solves the issue, though fileEncoding = "UTF-8"
must be specified while writing on the outfile
However there are still some weird charachters under "full text" I think because of emojis
These are unicode emojis and I'm ok with that
RT @ScottAnthonyUSA: <U+26A0><U+FE0F> IT SHOULD BE NOTED that the CDC initially had an embargo placed on CDC testimony. The TRUMP ADMINISTRATION LIFTED…
But what about this? iOS emoji?
RT @StocksUnhinged: $SPY $AAL $APT $EXPE $GOOG $DAL $UAL $BA $LAKE $YUM $CMG $HUM $CI #CDC expected to announce first US case of #Wuhan…
If you can give me a tweet id that will help me test.
If you can give me a tweet id that will help me test.
1219771346768596992
this the one with the dollar signs
Hello, I've a pretty large dataset (> 2 TB) split in six files.
I assumed that UTF-8 were the text encoding of jsonl files. However there are some charachters that apparently are non-UTF.8 and this causes R to fail when I specify the encoding.
Not specifying the encoding results in a messy full_text output