Twitter search scraping sometimes stops early

JustAnotherArchivist commented 5 years ago

snscrape twitter-search 'ProJared since:2019-05-01' stopped much too early just now. No error or anything, it just stopped.

JustAnotherArchivist commented 5 years ago

This is not the same as #4. There are actually more pages, snscrape just did not retrieve them.

eyaler commented 3 years ago

@JustAnotherArchivist i got this latest code installed from git. TwitterSearchScraper stopped early with no error. max-position is not longer in the code. i am able to continue with max_id. is this the intended way? should i wrap TwitterSearchScraper with a while True loop? how can one discern between early stopping and end of search? (in may case i searched just by language so i am sure it stopped early)

JustAnotherArchivist commented 3 years ago

I never found a way to detect early stops reliably. There was no indication of it in Twitter's responses at the time, at least. And its non-reproducibility makes debugging very difficult...

Some ideas:

Retry until you're satisfied that you've likely reached the true end. You may want to delay the retries in case Twitter's search index has issues.
Perform statistical analysis on the results. I haven't thought about this in detail. But if you're searching for something with a lower boundary X for the date (i.e. since:X) and the last result you have is from a later date Y, it should be possible to calculate a likelihood that there are indeed no tweets between X and Y. This would have to take into account that the rate of tweets can change over time though, sometimes drastically, otherwise you may get false positives.
Manual checking. This is what I've done in the past and basically a cruder version of the above. I split my scrape into monthly chunks and check if the oldest result within each chunk is close to YYYY-MM-01 00:00 UTC. If it isn't, it probably broke, and I'd rerun that month.

eyaler commented 3 years ago

thanks. what delay is recommended?

JustAnotherArchivist commented 3 years ago

Good question, no idea. Twitter's index breaking for hours isn't unprecedented (#123)... With simple retries, I'd probably wait a couple minutes between attempts so that it doesn't slow things down too much overall. With a clear indication that the results are incomplete, I'd use an exponential backoff.

JustAnotherArchivist commented 3 years ago

What a coincidence! I just ran into this again in a fairly reproducible way. A particular search of the form (a OR b) (min_faves:c OR min_replies:d OR min_retweets:e) since:f until:g (the exact filter conditions don't seem to matter; the date range in my tests was in early 2017) keeps breaking after a few hundred results for me. I can reproduce it on two machines on different continents with snscrape but was unable to trigger it in a browser. The breaking point varies. When it happens, Twitter returns the same cursor as requested, and retrying with the same cursor tends to work after a few attempts. I don't see any differences whatsoever between those responses and ones at the true end, except for timing: the responses at the true end arrive very quickly while the ones at a false end are slower (but still faster than real responses with tweets). Of course that's not at all a reliable measure.

It gets worse though: the results vary as well. I.e. running the same query multiple times returns different tweets.

This might be a special case and not really the original issue, but it's a useful one anyway due to the reproducibility. I will add some code to retry when Twitter returns no tweets and the same cursor as requested.

eyaler commented 3 years ago

thanks! so you now only wait a few seconds? you say it takes a few attempts... maybe a longer wait would be more robust? also, not sure if this is of any interest but here is my code to fetch as much tweets as you want in a certain language (and deal with various issues). i was able to get 20 million hebrew tweets at a rate of 400k tweets per hour. may be of interest to others using this great library. https://github.com/eyaler/hspell-wordlists/blob/main/har_zahav.py

JustAnotherArchivist commented 3 years ago

The code does not wait at all between attempts at the moment. In my tests, that wasn't necessary. But there may be cases where it is, and I'm certainly curious to hear about them. Ultimately, it's a trade-off as short scrapes (i.e. ones with few results) would be needlessly blocked for a long time at the end, which isn't desirable.

JustAnotherArchivist / snscrape

Twitter search scraping sometimes stops early #37