DocNow / twarc

A command line tool (and Python library) for archiving Twitter JSON
https://twarc-project.readthedocs.io
MIT License
1.36k stars 255 forks source link

v2 API support #372

Closed danielverd closed 3 years ago

danielverd commented 3 years ago

According to Twitter documentation, there is a new endpoint for academic researchers. Do any changes to the config file or twarc itself need to be done in order to access the new API results? Or does twarc just read/scrape the data provided by whatever API endpoint your acces token specifies? Thanks, and sorry for the potentially non-technical question.

https://developer.twitter.com/en/solutions/academic-research/products-for-researchers

edsu commented 3 years ago

The new academic search is the new v2 api endpoint, but with an account that they have designated as an actual academic researcher. You need to apply to have your account blessed in this way and then you simply use the v2 API.

We do not yet support the v2 api in twarc. But we have a call next Monday to discuss. Let me know if you would like to participate. I will use this issue as a way to track v2 api support if that's ok?

danielverd commented 3 years ago

You can absolutely repurpose this issue. In regards to the call, I wouldn't have much to contribute. But, it'd be cool to just listen in if that's alright

melaniewalsh commented 3 years ago

Just wanted to upvote twarc support for the V2 API! I think researchers will be interested in the new Academic Track, and it would be great if twarc could support them.

edsu commented 3 years ago

HI @melaniewalsh! We have a call on Monday at 6pm EST to discuss v2 api support. If you are interested drop me a note at ehs@pobox.com and I will add you.

melaniewalsh commented 3 years ago

Awesome, @edsu! I will be in touch.

edsu commented 3 years ago

@melaniewalsh let me know about the development that Twitter have done on search-tweets-python-v2. Once installed you can run search_tweets.py from the command line to search the v2 api. If you've played with the labs api or v2 api before you'll know that you have to request particular fields and expansions. I experimented with requesting all possible fields in order to get the maximum amount of information out of the API and came up with this command example:

search_tweets.py \
    --query obama \
    --results-per-call 100 \
    --tweet-fields id,text,attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,in_reply_to_user_id,lang,possibly_sensitive,public_metrics,referenced_tweets,reply_settings,source,withheld \
    --user-fields id,name,username,created_at,description,entities,location,pinned_tweet_id,profile_image_url,protected,public_metrics,url,verified,withheld \
    --media-fields media_key,type,duration_ms,height,preview_image_url,public_metrics,width \
    --poll-fields id,options,duration_minutes,end_datetime,voting_status \
    --place-fields full_name,id,contained_within,country,country_code,geo,name,place_type \
    --expansions author_id,referenced_tweets.id,in_reply_to_user_id,attachments.media_keys,attachments.poll_ids,geo.place_id,entities.mentions.username,referenced_tweets.id.author_id \
    --filename-prefix obama \
    --debug

# I removed these fields so it wouldn't silently error out
# --tweet-fields non_public_metrics,organic_metrics,promoted_metrics
# --media-fields non_public_metrics,organic_metrics,

I'm having a bit of difficulty interpreting the results, since each line in the resulting output seems to be a different type of object? I'm curious what tools downstream from this input would look like.

I'm going to leave it running to see how this utility handles rate limits.

edsu commented 3 years ago

Ahh nice, it does seem to handle the rate limits with some kind of incremental back off?

ERROR:searchtweets.result_stream:Rate limit hit... Will retry...
ERROR:searchtweets.result_stream:Will retry in 36 seconds...
ERROR:searchtweets.result_stream:Rate limit hit... Will retry...
ERROR:searchtweets.result_stream:Will retry in 64 seconds...
ERROR:searchtweets.result_stream:Rate limit hit... Will retry...
ERROR:searchtweets.result_stream:Will retry in 100 seconds...
ERROR:searchtweets.result_stream:Rate limit hit... Will retry...
ERROR:searchtweets.result_stream:Will retry in 144 seconds...
edsu commented 3 years ago

It finished with this message:

Error parsing content as JSON.
INFO:searchtweets.result_stream:ending stream at 494208 tweets

The resulting file contains 504,138 lines. Since I requested --results-per-call to 100 it looks like these come over as 100 separate lines. Each set of 100 tweets includes after an extensions object that has properties for media, places, tweets and users. This line is followed by a line that includes properties for newest_id, next_token, oldest_id, and result_count.

It looks like the oldest tweet I received was created Wed Jan 27 00:35:54 +0000 2021 and the newest was Sun Jan 31 02:26:18 +0000 2021.

On trying to restart the collecting I received this error ERROR:searchtweets.result_stream: HTTP Error code: 429 with this JSON body:

{
  "account_id":9999999999999999,
  "product_name":"standard-basic",
  "title":"UsageCapExceeded",
  "period":"Monthly",
  "scope":"Product",
  "detail":"Usage cap exceeded: Monthly product cap",
  "type":"https://api.twitter.com/2/problems/usage-capped"
}

I appear to have exhausted my quota for the month. Note, I do not have academic search turned on. This of course is vastly different than what could previously be collected from the v1.1 search API where (with application auth) you could retrieve 4,320,000 in 24 hours without exhausting your quota for the month. For comparison this is 500,000 per month quota compared with 129,600,000 per month, which is down by a factor of 260!

One thing this experiment highlighted for me is that this tool only uses the search API. Wouldn't we want twarc to be able to at least get data in real time from the v2 filter stream, and possibly other endpoints from v2? Perhaps these subcommands could work?

These commands would do the work of figuring out what the highest representation for the response is. This would need to change depending on whether app or user auth is being used. In addition some actions like filter are no longer atomic. They require first setting up a filter job and then retrieving it. Should this happen automatically behind the scenes, should twarc require multiple interactions to create a filter job, list filter jobs, and activate filter jobs?

Of course these are simple, and in theory could be layered into the current codebase. But we could choose to start afresh and let twarc work with the v1.1 API and create a twarc2? If you've looked at the source code you can see there is quite a bit of cruft now, and it might be good to start afresh?

edsu commented 3 years ago

I added a strawman proposal for a twarc2 over here which we can discuss on the Feb 1 call.

melaniewalsh commented 3 years ago

The strawman proposal for twarc2 looks great, @edsu!

I think the "stitch" option will be crucial as that seems to be one of the biggest and most challenging changes to the API. Also I agree that you would want people to get real-time data through the filter stream.

But wow I didn't realize that the monthly tweet cap has been curbed so much for basic access... The developer documentation says that they might introduce some elevated cap options for "other types of developers," but it's not clear what that means:

There is a Project-level Tweet cap limiting the number of Tweets you can retrieve from several Twitter API v2 endpoints. This is set to 500,000 Tweets per month for Standard Projects at the Basic access level, and 10,000,000 Tweets per month for Academic Projects...The existing Tweet caps are hard limits. In future releases, we will be launching elevated access options for the Twitter API v2 endpoints across every product track. This will create different options for academic researchers, businesses, and other types of developers. Learn how to stay informed of our plans and launches.

edsu commented 3 years ago

Noting that tweepy are working on v2 support which is good to see, and lets twarc focus on the things that it is good at, namely making it easy to retrieve data from the api and persist it as files.

https://github.com/tweepy/tweepy/pull/1535

edsu commented 3 years ago

v2.0.0 with twitter v2 api support was just uploaded to pypi!