Closed bmorphism closed 4 years ago
Let's keep working on the branch till we're ready to merge
@bechbd We need some deltas to neo4j:
conversation_id
, geo
retweet_id
, maybe othersIn addition, twint
largely fails at grabbing user profile data. We're probably better off doing a separate prefect job that adds independently hydrates recently recorded_created_at
user ids that we found (empty/partial). We'll take a look at that our next session.
Added conversation_id
and geo
properties. Other properties already allow nulls
Merging as working for local (non-neo4j push) use of twint
This begins the integration of Twint data to the dataframe format required for storing of tweets into Neo4j.
A few key differences from Twarc and other details:
twint.run.Search()
twint.run.Followers()
andtwint.run.Lookup()
@lmeyerov and I spent some time to get the fields from the Twint df to line up with that what we were getting from Twarc, but work remains on integrating several additional fields (see https://github.com/TheDataRideAlongs/ProjectDomino/blob/twint/modules/Twint.py#L59)
Of note are:
user_mentions
,retweet_id
,in_reply_to_status_id
.Consequently,
twint
is designed to generate tweets based on Since and Until timestamps (with granularity down to a second) and can operate as a streaming mechanism, whereastwarc
can be preserved for historic pulls by id.