Scraping Tweets of certain handles via search takes very long time although the handle has no Tweets

JustAnotherArchivist / snscrape

A social networking service scraper in Python

GNU General Public License v3.0

4.52k stars 712 forks source link

Scraping Tweets of certain handles via search takes very long time although the handle has no Tweets #584

Closed GelatoGenesis closed 2 years ago

GelatoGenesis commented 2 years ago

If you try this:

time snscrape twitter-search from:tiagobatista20_
snscrape twitter-search from:tiagobatista20_  2.22s user 0.36s system 3% cpu 1:20.43 total

It takes awfully long (1 min 20 seconds) for a search term which return no results. If I go to Twitter and enter the same query, I get: No results for "from:tiagobatista20_"

But this does not apply to all non-existing handles. For example:

time snscrape twitter-search from:non-ex-gibberish-handle
snscrape twitter-search from:non-ex-gibberish-handle  0.17s user 0.03s system 12% cpu 1.584 total

I am not too familiar with assumptions that snscrape relies on, but I debugged a bit of your code and got a feeling it's something to do with cursor that Twitter API pagination returns. In each loop it keeps returning a new cursor so the code assumes there's more to come. But nothing ever comes. Maybe it's because this handle used to exist and it doesn't anymore, so Twitter API behaves strangely? Maybe I'm totally off here.

JustAnotherArchivist commented 2 years ago

Funny that you mention this, I just saw this happening for the first time an hour ago. Specifically, scraping ChrisWarcraft takes about 20 minutes to finish without results.

However, I don't think there's anything wrong here on snscrape's side. Twitter's web interface is broken. It sometimes says 'no results' even though there actually are results because it stops paginating. snscrape keeps going until there is no next page anymore. Sometimes, pages can be empty. In these cases, all pages are empty. But it's impossible (to my knowledge) to know that without iterating over all pages.

In short, another upstream bug at Twitter.

GelatoGenesis commented 2 years ago

Ok, understandable. As an alternative, I'm looking for a way to add a timeout to the loop. My use case is looking for last 24 hour tweets that fall under different queries and I'm using snscrape as a library in my python code:

    min_timestamp = round(time.time()) - 3600
    for i, tweet in enumerate(sntwitter.TwitterSearchScraper(search_query).get_items()):
        print("in loop")
        timestamp = round(tweet.date.timestamp())
        if timestamp < minTimestamp:
            break
        response.append(tweet)

Problem is I never actually reach "in loop" as snscrape never yields because no tweets are found. I'm running multiple such loops in parallel and I've had times when they all run into those "nasty" handles which kind of crashed my app. If I could somehow protect myself from those long running queries with a timeout (say give up if you've scrolled for N seconds) then I'd be kind of safe.

JustAnotherArchivist commented 2 years ago

Yeah, that's not possible directly. You could move the scrape to a separate thread or process and kill that after a timeout (note though that snscrape makes no guarantees about thread safety currently, and the Twitter module is indeed not thread-safe by default), or you could subclass the scraper and hack something together with its undocumented innards. But that's too niche to be implemented officially.

GelatoGenesis commented 2 years ago

@JustAnotherArchivist Last question about thread safety. What do you mean by that the Twitter module is not thread safe? Let's say I do this sntwitter.TwitterSearchScraper(search_query).get_items() from multiple threads concurrently with different queries. Is there a danger that those queries mess eachother up?

JustAnotherArchivist commented 2 years ago

@TarmoKalling Yes, the Twitter module has global state (the guest token manager) which is not handled in a thread-safe way. Anything can happen when the token expires. I'm not certain that's the only problem either; thread safety was never a goal during development so far.

bipsen commented 2 years ago

@JustAnotherArchivist What consequences does this have if one was to parallelize twitter scraping externally (like you propose in another issue). You mention the problem being the GuestTokenManager, which aside from being global also saves the token to a file in ~/.cache. I see that file is managed by filelock, but I'm still wondering if the issue persists as multiple instances of python would still share that file?

Thanks a ton for your time and great work on this project.

TheTechRobo commented 2 years ago

Multiple instances of Python should work fine IIRC, but I think the problem is that the GuestTokenManager is global which could introduce a race condition.

JustAnotherArchivist commented 2 years ago

That's correct. The token file is exclusive to the CLI. Scrapes directly from python use the bare GuestTokenManager, which is not concurrency-safe, so you'd need to use a separate instance for every parallel scrape. There may be other issues when it comes to thread safety, too. The CLI is not affected by that.