alexhoneker commented 3 years ago

Hi!

I'm new to Python so forgive my ignorance. I've been downloading tweets with Twarc2. I was including "--no-inline-referenced-tweets" so I don't get duplicate RT entries when exporting to CSV. I just noticed that the RT lines do not include the full text of the RTs (whereas the original Tweet does if I include it).

My problem is that if I don't include "--no-inline-referenced-tweets", then the following code I found in here to deal with foreign characters does not work "df['text'] = df['text'].apply(json.loads)". Is there any way to get both RT and original tweet lines (I can then delete the duplicate) and keep the character conversion by using ".apply(json.loads)"?

Below is my code:

Download @volkspartei tweets

!twarc2 search "(from:volkspartei)" --archive > OVP.jsonl

Convert JSONL file to CSV + eliminate RT duplicates

!twarc2 csv --json-encode-text --no-inline-referenced-tweets OVP.jsonl OVP.csv

Convert to data frame to delete blank lines in CSV file

import pandas as pd import json OVP = pd.read_csv("OVP.csv")

Convert German characters and emojis

OVP['text'] = OVP['text'].apply(json.loads)

Check data`

OVP[['text','created_at','author.created_at']]

Save back to CSV

OVP.to_csv("OVP_1.csv")

Thanks! Alex

igorbrigadir commented 3 years ago

Thanks for the examples! I checked it out:

First, make sure you've the latest version:

pip install --upgrade twarc-csv

And run the CSV conversion again.

Technically, if you're working with UTF8, you don't need to convert anything - python should load the German and emoji characters just fine. But i did get errors with that before, but i couldn't find any errors. What environment are you running these commands in? Python notebooks? On Windows or in Mac / Linux?

I don't get the duplicates part - there should be no duplicate tweets in the CSV, as each tweet based on ID is included only once - if you don't want RTs at all, the best thing is to exclude them from the search using -is:retweet operator.

--no-inline-referenced-tweets removes any tweets that are found in referenced_tweets - these include the original retweeted tweets. Unfortunately, the Retweets - nearly always have truncated text, this is a limitation of the API - the full text is available in the original tweet, which is in referenced_tweets. I could go in an add more granular filters for --no-inline-referenced-tweets but that seemed to over complicate things.

That user seems to retweet themselves a lot, so that may explain "duplicates", these are in fact, all unique tweets - the script will keep 1 original tweet, and every subsequent retweet of that tweet it finds, which are all uniquer tweets with their own metadata.

I tried:

twarc2 search "(from:volkspartei)" --archive --limit 100 OVP.jsonl
twarc2 search "(from:volkspartei) -is:retweet" --archive --limit 100 OVP_no_rt.jsonl

Converted to csv:

twarc2 csv OVP.jsonl OVP.csv
twarc2 csv OVP_no_rt.jsonl OVP_no_rt.csv

Then to load:

import pandas as pd
ovp = pd.read_csv("OVP.csv")
ovp_no_rt = pd.read_csv("OVP_no_rt.csv")

gives me:

>>> ovp.sample(20)['text']
416    „Das Wichtigste ist, dass die Bevölkerung mitm...
506    RT @susanneraab_at: Die Einbindung von Experti...
442    RT @k_edtstadler: Erneut wurde die Grazer Syna...
457    Achtung ❗️\nReisewarnung für Kroatien 🇭🇷 ab de...
...

>>> ovp_no_rt.sample(20)['text']
43     "Wir öffnen damit die Grenzen gegenüber allen ...
486    "Es gibt nur mehr 3 Gründe, das Haus zu verlas...
420    "In Österreich wird die Situation im Pflegeber...
265    “Vereine sind das Herz &amp; der Motor unserer...
480    “Als Bundesregierung sind wir bemüht, alle Vor...
361    “Wir werden Wege finden müssen, das Ende diese...
138    „Bei der zweiten Maßnahme geht es um den Aufba...
...

(My terminal displays unicode)

Also, you can directly specify what columns you want when converting:

If you only want these columns:

OVP[['text','created_at','author.created_at']]

you can run:

twarc2 csv --output-columns "text,created_at,author.created_at" OVP.jsonl OVP.csv

Please Reopen and Let me know if there's still a problem!

alexhoneker commented 3 years ago

Thank you so much for the quick answer, Igor! I thought I had solved the issue on Friday but I'm still having trouble.

First, let me clarify: by "duplicate tweets" I meant having both the original tweet and the truncated retweet. Also, I'm running these commands in Jupyter Notebook and I have Windows 10.

This is what I tried on Friday as an example and it worked:

Download Tweets: !twarc2 search "(from:volkspartei)" --archive --limit 20 > OVP_20.jsonl
Convert JSONL file to CSV:
!twarc2 csv --json-encode-text OVP_20.jsonl OVP_20.csv
Convert to data frame to delete blank lines: OVP_20 = pd.read_csv("OVP_20.csv") OVP_20['text'] = OVP_20['text'].apply(json.loads) OVP_20[['text','created_at','author.created_at']]
Filter "text" column starting with "RT" to exclude truncated retweets and keep only original tweet:
OVP_20 = OVP_20[~OVP_20.text.str.startswith("RT")] OVP_20[['text']]

This gave me what I wanted (readable German characters and emojis, original retweets with full text). However, when I tried the exact same thing but with the full archive search (which is what I need), I get the following errors:

!twarc2 search "(from:volkspartei)" --archive > OVP.jsonl
!twarc2 csv --json-encode-text OVP.jsonl OVP.csv
OVP = pd.read_csv("OVP.csv") OVP['text'] = OVP['text'].apply(json.loads) OVP[['text','created_at','author.created_at']]

Here I get this error: "DtypeWarning: Columns (10,21,23,30,33,36,41,42,43,44,45,46,47,48,49,51,53,54,56,57,58,59,60,61,62,63,64,65,70,71,72,74,76) have mixed types.Specify dtype option on import or set low_memory=False. has_raised = await self.run_ast_nodes(code_ast.body, cell_name,"

If I don't include "OVP['text'] = OVP['text'].apply(json.loads)", the code runs but the characters in German do not display and step #4 above cannot be run due to this error: TypeError: bad operand type for unary ~: 'float'

I'm sorry for this long reply, but I'm really lost here! Thank you so much again!

igorbrigadir commented 3 years ago

Ah ok, the dtype error is something else i think, also an error i have to check

alexhoneker commented 3 years ago

Update:

First error was solved by changing a setting on Windows 10 (See: [https://scholarslab.github.io/learn-twarc/08-win-region-settings])
Second error (TypeError: bad operand type for unary ~: 'float') was solved by deleting blank rows (why there are blank rows in the first place I'm not sure):
DataFrame= DataFrame.dropna(axis=0, subset=['text'])

igorbrigadir commented 3 years ago

Great! Sounds like it's resolved then. It shouldn't write blank lines in the CSV either - i'll upload the new version momentarily - please reopen if there are more errors!

DocNow / twarc-csv

Retweets and foreign language characters #16

Download @volkspartei tweets

Convert JSONL file to CSV + eliminate RT duplicates

Convert to data frame to delete blank lines in CSV file

Convert German characters and emojis

Check data`

Save back to CSV