Closed igorbrigadir closed 5 years ago
Hey @igorbrigadir, thanks for reporting this! That's really a shame as the GDPR export contains a lot of private data that people might be less willing to share.
I guess the easiest way to solve this issue is by asking people to just upload the tweets.js
file instead. Do you think that's a good solution?
Yes, I wouldn't trust uploading the entire file - maybe a warning message telling people exactly what's in there would be good too?
At the same time though, the archive also has a lot of opportunity for extra analysis, if people are willing to provide the data - eg: likes, friends and follows, lists. This needs more work though because it looks like it's just IDs you get back.
I've requested my data from Twitter and wait for them to send me the download link. Once I have this it should become easier for me to understand what needs to be adapted.
Ok, I got my data and see the issue(s). Here are some notes for future me when getting on fixing this:
The tweet.js
file that Twitter provides is once again no real JSON file. To fix it in Python:
tweet_string = open('tweet.js','r').readlines()
tweet_string = "".join([i.strip() for i in tweet_string])
tweet_string = tweet_string[25:]
This fixes at least that problem. But then comes the second issue: By dumping all the data into a single big file it becomes hard to parse all of that into memory. Luckily, ijson
allows iterative parsing of big JSON files.
To do this:
import ijson
objects = ijson.items(open('tweet_fixed.js','r'),'item')
for o in objects:
print(o)
That way you can iterate over o
, which is an individual tweet and dump the data right into the arrays that we're using so far to create the overall pandas.Dataframe
.
This should be fixed with #43. Thanks for reporting it @igorbrigadir!
With the new Twitter UI came new Settings pages, "Your Tweet Archive" is no longer available unfortunately, the only format for the archive is the different GDPR compliant one https://twitter.com/settings/your_twitter_data with
tweets.js
instead of the CSV the script expects. (Previously in #39 )