TheDataRideAlongs / ProjectDomino

Scaling COVID public behavior change and anti-misinformation
Apache License 2.0
61 stars 13 forks source link

WIP: Twint #69

Closed bmorphism closed 4 years ago

bmorphism commented 4 years ago

This begins the integration of Twint data to the dataframe format required for storing of tweets into Neo4j.

A few key differences from Twarc and other details:

@lmeyerov and I spent some time to get the fields from the Twint df to line up with that what we were getting from Twarc, but work remains on integrating several additional fields (see https://github.com/TheDataRideAlongs/ProjectDomino/blob/twint/modules/Twint.py#L59)

Of note are: user_mentions, retweet_id, in_reply_to_status_id.

Consequently, twint is designed to generate tweets based on Since and Until timestamps (with granularity down to a second) and can operate as a streaming mechanism, whereas twarc can be preserved for historic pulls by id.

lmeyerov commented 4 years ago

Let's keep working on the branch till we're ready to merge

@bechbd We need some deltas to neo4j:

In addition, twint largely fails at grabbing user profile data. We're probably better off doing a separate prefect job that adds independently hydrates recently recorded_created_at user ids that we found (empty/partial). We'll take a look at that our next session.

bechbd commented 4 years ago

Added conversation_id and geo properties. Other properties already allow nulls

lmeyerov commented 4 years ago

Merging as working for local (non-neo4j push) use of twint