Closed GelatoGenesis closed 2 years ago
Funny that you mention this, I just saw this happening for the first time an hour ago. Specifically, scraping ChrisWarcraft takes about 20 minutes to finish without results.
However, I don't think there's anything wrong here on snscrape's side. Twitter's web interface is broken. It sometimes says 'no results' even though there actually are results because it stops paginating. snscrape keeps going until there is no next page anymore. Sometimes, pages can be empty. In these cases, all pages are empty. But it's impossible (to my knowledge) to know that without iterating over all pages.
In short, another upstream bug at Twitter.
Ok, understandable. As an alternative, I'm looking for a way to add a timeout to the loop. My use case is looking for last 24 hour tweets that fall under different queries and I'm using snscrape as a library in my python code:
min_timestamp = round(time.time()) - 3600
for i, tweet in enumerate(sntwitter.TwitterSearchScraper(search_query).get_items()):
print("in loop")
timestamp = round(tweet.date.timestamp())
if timestamp < minTimestamp:
break
response.append(tweet)
Problem is I never actually reach "in loop" as snscrape never yields because no tweets are found. I'm running multiple such loops in parallel and I've had times when they all run into those "nasty" handles which kind of crashed my app. If I could somehow protect myself from those long running queries with a timeout (say give up if you've scrolled for N seconds) then I'd be kind of safe.
Yeah, that's not possible directly. You could move the scrape to a separate thread or process and kill that after a timeout (note though that snscrape makes no guarantees about thread safety currently, and the Twitter module is indeed not thread-safe by default), or you could subclass the scraper and hack something together with its undocumented innards. But that's too niche to be implemented officially.
@JustAnotherArchivist Last question about thread safety. What do you mean by that the Twitter module is not thread safe?
Let's say I do this sntwitter.TwitterSearchScraper(search_query).get_items()
from multiple threads concurrently with different queries. Is there a danger that those queries mess eachother up?
@TarmoKalling Yes, the Twitter module has global state (the guest token manager) which is not handled in a thread-safe way. Anything can happen when the token expires. I'm not certain that's the only problem either; thread safety was never a goal during development so far.
@JustAnotherArchivist What consequences does this have if one was to parallelize twitter scraping externally (like you propose in another issue). You mention the problem being the GuestTokenManager, which aside from being global also saves the token to a file in ~/.cache. I see that file is managed by filelock, but I'm still wondering if the issue persists as multiple instances of python would still share that file?
Thanks a ton for your time and great work on this project.
Multiple instances of Python should work fine IIRC, but I think the problem is that the GuestTokenManager is global which could introduce a race condition.
That's correct. The token file is exclusive to the CLI. Scrapes directly from python use the bare GuestTokenManager
, which is not concurrency-safe, so you'd need to use a separate instance for every parallel scrape. There may be other issues when it comes to thread safety, too. The CLI is not affected by that.
If you try this:
It takes awfully long (1 min 20 seconds) for a search term which return no results. If I go to Twitter and enter the same query, I get: No results for "from:tiagobatista20_"
But this does not apply to all non-existing handles. For example:
I am not too familiar with assumptions that snscrape relies on, but I debugged a bit of your code and got a feeling it's something to do with cursor that Twitter API pagination returns. In each loop it keeps returning a new cursor so the code assumes there's more to come. But nothing ever comes. Maybe it's because this handle used to exist and it doesn't anymore, so Twitter API behaves strangely? Maybe I'm totally off here.