Fewer Tweets in CSV than hydrated

Tom1316 commented 3 years ago

After hydration, I end up with Tweet Ids Read: 8,145,625/Tweets Hydrated: 7,583,036 which is to be expected. However, when I opened the CSV in R, I only had 2,586,793 observations. Why is there such a large discrepancy? Does this mean over half the Tweets are not being converted?

edsu commented 3 years ago

Hmm, that's not good. Are you able to share the CSV privately with me at ehs@pobox.com for me to take a look? Also can you point me at the tweet ids?

Tom1316 commented 3 years ago

Sure thing, however the CSV files i 1.5-ish GB, so I'll have to upload it to Google Drive first. Once this is done (in about an hour) I'll send you a link. The Tweet IDs were taken from the following database:https://zenodo.org/record/4726282. I collected every Tweet ID between 7-14 April as .txt files. I then concatenated the individual textfiles into one text file, and ran it through the hydrator. The JSONL is about 35GB, which is large compared to the CSV. I'm trying to look at ways to load the JSONL into R, but I'm very new to computational work and am struggling to figure out what to do. - Thank you for your help

edsu commented 3 years ago

I can confirm that there are 2,586,794 rows in he CSV you shared. Could you share the tweet id file and I will try with the Hydrator too. If you like I can also try hydrating with twarc which can be a bit more reliable for large datasets.

Tom1316 commented 3 years ago

OK, Ed, I'll email the file to you shortly. Does this mean the JSON produced by Hydrator is also corrupt or only the CSV? I'm hoping to use this dataset for my dissertaion, so would appreciate it if you could run it in twarc and send me the CSV. Additionally, do you know why the Tweets were dropped during converstion?

edsu commented 3 years ago

Ok, let me know when you send the tweet ids.

I'm not sure if the JSON is corrupt, but it's possible. One way to check would be to run a little program over it and count the lines that have a valid JSON object on them. If you have Python installed and your JSON file is called for example tweets.jsonl you could run a program like this.


import json

count = 0
for line in open('tweets.jsonl'):
    tweet = json.loads(line)
    count += 1

print(count)

Tom1316 commented 3 years ago

Hi Ed,

The tweet ids should be in the link in my previous email. If it didn't send or open, I'll compress the file it and attach via email. Thank you for the python script, I'll give it a try. I'm trying to run soemthing similar in R, but I keep encountering memory or vector errors. I'll keep trying! Please let me know if you have any luck downloading the Tweet IDs and converting it your end.

Best Tom

edsu commented 3 years ago

I was able to download your hydrated json from Google Drive (thanks!). The good news is that it looks intact, with 7,583,036 valid JSON objects. I guess something must have gone wrong when hydrator tried to write the data. Perhaps it was interrupted? It would probably have taken a fair amount of time.

If you are interested you could convert the JSON to CSV using the json2csv.py utility. Since you are in a pinch with the research I could run this for you and send you the results.

Tom1316 commented 3 years ago

Hi Ed,

Thank you so much for this. Surprisingly, the CSV conversion was pretty rapid; maybe that was the issue? If there's a way to send you an error log I'd be to support however I could. It's encouraging that the number of JSON objects matches the hydrated Tweets.

I would be most grateful if you'd be able to run it and send the CSV. Afterwards I can try and open in it R and manipulate it from there. When I'm not against a deadline, I'll try the py utility as I need to keep developing my skills. I've passed it onto my classmates as I know a couple of them want to deal with JSON objects.

Once again I can't thank you enough for your support, it really means a lot!

Best, Tom

edsu commented 3 years ago

Ok, I will respond with a private email with the link to the CSV. Let me know if you are able to read this with R. It will be a large DataFrame, so depending on your setup/resources it might make sense to subset just the data you need before loading it. The csvcut utility from csvkit might provide a nice way to do that.

Tom1316 commented 3 years ago

It worked perfectly and the dataframe has fully loaded. Thanks for the assist. I'll take your advice and probably look to reduce the dataframe to make it easy to manipulate.

edsu commented 3 years ago

Since we have #56 and #51 for problems knowing how long CSV generation is taking can we close this ticket?

Tom1316 commented 3 years ago

Yes, thank you for your help.

DocNow / hydrator

Fewer Tweets in CSV than hydrated #88