Closed JustAnotherArchivist closed 5 years ago
This is not the same as #4. There are actually more pages, snscrape just did not retrieve them.
@JustAnotherArchivist i got this latest code installed from git. TwitterSearchScraper stopped early with no error. max-position is not longer in the code. i am able to continue with max_id. is this the intended way? should i wrap TwitterSearchScraper with a while True loop? how can one discern between early stopping and end of search? (in may case i searched just by language so i am sure it stopped early)
I never found a way to detect early stops reliably. There was no indication of it in Twitter's responses at the time, at least. And its non-reproducibility makes debugging very difficult...
Some ideas:
since:X
) and the last result you have is from a later date Y, it should be possible to calculate a likelihood that there are indeed no tweets between X and Y. This would have to take into account that the rate of tweets can change over time though, sometimes drastically, otherwise you may get false positives.YYYY-MM-01 00:00 UTC
. If it isn't, it probably broke, and I'd rerun that month.thanks. what delay is recommended?
Good question, no idea. Twitter's index breaking for hours isn't unprecedented (#123)... With simple retries, I'd probably wait a couple minutes between attempts so that it doesn't slow things down too much overall. With a clear indication that the results are incomplete, I'd use an exponential backoff.
What a coincidence! I just ran into this again in a fairly reproducible way. A particular search of the form (a OR b) (min_faves:c OR min_replies:d OR min_retweets:e) since:f until:g
(the exact filter conditions don't seem to matter; the date range in my tests was in early 2017) keeps breaking after a few hundred results for me. I can reproduce it on two machines on different continents with snscrape but was unable to trigger it in a browser. The breaking point varies. When it happens, Twitter returns the same cursor as requested, and retrying with the same cursor tends to work after a few attempts. I don't see any differences whatsoever between those responses and ones at the true end, except for timing: the responses at the true end arrive very quickly while the ones at a false end are slower (but still faster than real responses with tweets). Of course that's not at all a reliable measure.
It gets worse though: the results vary as well. I.e. running the same query multiple times returns different tweets.
This might be a special case and not really the original issue, but it's a useful one anyway due to the reproducibility. I will add some code to retry when Twitter returns no tweets and the same cursor as requested.
thanks! so you now only wait a few seconds? you say it takes a few attempts... maybe a longer wait would be more robust? also, not sure if this is of any interest but here is my code to fetch as much tweets as you want in a certain language (and deal with various issues). i was able to get 20 million hebrew tweets at a rate of 400k tweets per hour. may be of interest to others using this great library. https://github.com/eyaler/hspell-wordlists/blob/main/har_zahav.py
The code does not wait at all between attempts at the moment. In my tests, that wasn't necessary. But there may be cases where it is, and I'm certainly curious to hear about them. Ultimately, it's a trade-off as short scrapes (i.e. ones with few results) would be needlessly blocked for a long time at the end, which isn't desirable.
snscrape twitter-search 'ProJared since:2019-05-01'
stopped much too early just now. No error or anything, it just stopped.