Add Support for new Archive Format

gedankenstuecke / twitter-analyser

export data from twitter archive and visualize it

http://twarxiv.org

MIT License

25 stars 11 forks source link

Add Support for new Archive Format #42

Closed igorbrigadir closed 5 years ago

igorbrigadir commented 5 years ago

With the new Twitter UI came new Settings pages, "Your Tweet Archive" is no longer available unfortunately, the only format for the archive is the different GDPR compliant one https://twitter.com/settings/your_twitter_data with tweets.js instead of the CSV the script expects. (Previously in #39 )

gedankenstuecke commented 5 years ago

Hey @igorbrigadir, thanks for reporting this! That's really a shame as the GDPR export contains a lot of private data that people might be less willing to share.

I guess the easiest way to solve this issue is by asking people to just upload the tweets.js file instead. Do you think that's a good solution?

igorbrigadir commented 5 years ago

Yes, I wouldn't trust uploading the entire file - maybe a warning message telling people exactly what's in there would be good too?

At the same time though, the archive also has a lot of opportunity for extra analysis, if people are willing to provide the data - eg: likes, friends and follows, lists. This needs more work though because it looks like it's just IDs you get back.

gedankenstuecke commented 5 years ago

I've requested my data from Twitter and wait for them to send me the download link. Once I have this it should become easier for me to understand what needs to be adapted.

gedankenstuecke commented 5 years ago

Ok, I got my data and see the issue(s). Here are some notes for future me when getting on fixing this:

The tweet.js file that Twitter provides is once again no real JSON file. To fix it in Python:

tweet_string = open('tweet.js','r').readlines()
tweet_string = "".join([i.strip() for i in tweet_string])
tweet_string = tweet_string[25:]

This fixes at least that problem. But then comes the second issue: By dumping all the data into a single big file it becomes hard to parse all of that into memory. Luckily, ijson allows iterative parsing of big JSON files.

To do this:

import ijson

objects = ijson.items(open('tweet_fixed.js','r'),'item')

for o in objects:
  print(o)

That way you can iterate over o, which is an individual tweet and dump the data right into the arrays that we're using so far to create the overall pandas.Dataframe.

gedankenstuecke commented 5 years ago

This should be fixed with #43. Thanks for reporting it @igorbrigadir!