JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.49k stars 712 forks source link

Is there a total scrape timeout functionality? #391

Closed atakankupeli closed 2 years ago

atakankupeli commented 2 years ago

I am using the Twitter search scraper in a time-critical system with a large number of concurrent scrapes. In order for my use case to work, scraper instances must stop scraping if a certain amount of time has passed (return an empty list, raise an exception, etc.). I played with "retry" and "timeout" parameters in the base scraper, but it did not seem to work. Is this an established feature? If not, any idea on how I might implement it?

p.s. Amazing work on the project by the way, really appreciate what you guys do.

JustAnotherArchivist commented 2 years ago

Not directly. The timeout parameter on Scraper._request is for an individual request, not total execution time, and it's not currently modifiable from outside.

Here's an idea how you could approach this cleanly (untested):

import snscrape.modules.twitter
import sys
import time

startTime = time.time()
scraper = snscrape.modules.twitter.TwitterSearchScraper('...')
for tweet in scraper.get_items():
    if time.time() - startTime > 300:
        sys.exit(1)
    print(tweet.json())

That would exit after about 5 minutes; I think it shouldn't exceed that by more than the request timeout (10 seconds) times the number of retries in the worst case, which would be a request starting just before 5 minutes are reached and timing out multiple times. If it must have exited within some amount of time, you could simply subtract that, and it should be quite accurate then, although still not perfect.

An alternative would be to solve this from the outside. Run snscrape in a separate thread and kill that thread when it exceeds the required runtime, for example. Or a background process in bash.

JustAnotherArchivist commented 2 years ago

Closing this as it's better solved from outside of snscrape.