DocNow / twarc-csv

A plugin for twarc2 for converting tweet JSON into DataFrames and exporting to CSV.
MIT License
31 stars 10 forks source link

Expand Referenced Tweets #21

Closed igorbrigadir closed 3 years ago

igorbrigadir commented 3 years ago

Currently it leaves referenced_tweets list alone, the column in the CSV ends up like this:

[{"type": "replied_to", "id": "1380226330034372610"}]
[{"type": "quoted", "id": "1380226330034372610"}]
[{"type"": "retweeted", "id": "1261081519566675969"}]

but we could expand this into separate columns:

referenced_tweets.replied_to
referenced_tweets.quoted
referenced_tweets.retweeted

and by extension, type column should be a list like ["reply"] or ["retweet","reply","quote"] if it's a quote tweet that's a reply to someone that was then retweeted. type should also be __inferred_tweet_type or something to indicate where this field is coming from.

igorbrigadir commented 3 years ago

The CSV would endup looking like:

referenced_tweets.replied_to,referenced_tweets.quoted,referenced_tweets.retweeted,
,,1261081519566675969,
referenced_tweets.replied_to,referenced_tweets.quoted,referenced_tweets.retweeted,
1380226330034372610,,,

etc

igorbrigadir commented 3 years ago

Followup: https://twittercommunity.com/t/usersretweets-problem-with-json-normalize-to-flatten-the-nested-json/157713

igorbrigadir commented 3 years ago

This is done with ChainMap now:

            # reconstruct referenced_tweets object
            referenced_tweets = [
                {r["type"]: {"id": r["id"]}} for r in tweet["referenced_tweets"]
            ]
            # leave behind references, but not the full tweets
            # ChainMap flattens list into properties
            tweet["referenced_tweets"] = dict(ChainMap(*referenced_tweets))