DocNow / twarc

A command line tool (and Python library) for archiving Twitter JSON
https://twarc-project.readthedocs.io
MIT License
1.37k stars 255 forks source link

Random sample option #453

Open igorbrigadir opened 3 years ago

igorbrigadir commented 3 years ago

A --sample N option for twarc2 search to modify retrieval and get around the API limitation of returning all tweets in reverse chronological order.

eg:

twarc2 search --archive --sample 10 "example query" output.json

To return a 10% sample of results.

Tweet IDs are snowflake ids, which encode a millisecond timestamp. Sampling on the millisecond field is how twitter "samples" 1% streams, by taking a fixed window of millisecond values, which we can emulate here too. Specifying dates as tweet ids using since_id and until_id, it should be possible to get the millisecond time ranges required to effectively create random sample of tweets for a query.

This won't require any new dependencies, and will purely be based on rewriting the original query parameters to use since_id and until_id and issuing lots of calls (while staying within the rate limit) so it's maybe not suitable for a plugin, but i'm open to suggestions for deciding to include this in twarc2 vs plugin.

For reference:

https://github.com/client9/snowflake2time

https://twitter.com/JurgenPfeffer/status/1191172359216078849

https://twittercommunity.com/t/generating-a-random-set-of-tweet-ids/150255

edsu commented 3 years ago

I like it! Maybe --sample .10 or is that too dense?

igorbrigadir commented 3 years ago

Yeah - i'd like it to accept a "named" sample strategy or an integer to simplify things.

Some important points on this - sampling by splitting up a query into multiple ones will work technically but practically there's a huge speed trade-off:

The rate limit of 1 request per second, means that sampling in v2 will take 10 requests for every second of data. Which may be fine for filling in gaps but may not be viable for gathering a dataset that spans a long time. It should be fine for sampling a short time range.

SamHames commented 3 years ago

Maybe a dumb question: what is the point of recreating Twitter's sampling logic - what are the actual intended use cases?

I can see lots of use cases for sampling, but I struggle to see how they are satisfied by a slow recreation of the Twitter approach.

What would be useful to me is something like retrieving one small page of results starting from a randomly selected second within the hour. Each page gives you a point estimate for the tweet rate for that hour, and you could move through a wide time window quickly (approx. 1 week/15 minutes). This would let me narrow down time regions of interest and estimate tweet volumes before collecting everything.

igorbrigadir commented 3 years ago

Yeah this is the thing - given how slow it will be, it's only real use is to reconstruct a sample stream if you had gaps due to outages - which is pretty limited i admit.

But, the general "sampling" implementation, will let you do exactly what you describe. (I want it to be flexible enough to specify those things - i'm open to ideas on how to make this user friendly)

ZacharyST commented 2 years ago

Chiming in on use cases. I'm using the full archive search and estimate my queries will return 75 million tweets; Twitter's academic quota is 10 million. I'd gladly pass a 10% sample parameter to get 7.5 million tweets.

igorbrigadir commented 2 years ago

Yeah the trade off will be the time it takes to download (number of calls): you can make 1 call per second in the full archive search, and 1 call can retrieve a max of 500 tweets. To make a "sample" using this since_id until_id strategy in the worst case it will take roughly 10 times the number of calls to retrieve the same time range as it would normally take, because instead of making 1 call to get 500 tweets for example, it would take 10 calls each one taking one of the small time ranges of ids. I think https://twitter.com/JurgenPfeffer/status/1191172359216078849 explains it better (It's about the older Labs endpoints but v2 does it exactly the same way)