JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.48k stars 711 forks source link

No Request Limits? Proxy needed? #76

Closed WhiteLin3s closed 4 years ago

WhiteLin3s commented 4 years ago

Hey there,

two weeks ago, I set up a Twitter Scraper that used Docker to run e.g. 10 workers/scrapers simultaneously with the GetOldTweets3 package and continuously changed to a Proxy within the Tor-Network. Assigned single days with a query as tasks to my workers. Modified the GOT3 package, so that it keeps scraping/finishing a day even if a request limit occurred during the day/task, by forcing the scraper to retry with a new Proxy from the Network. Then GOT3 broke down..

Now I am trying to integrate your Scraper into my Docker Framework. Tried to understand the source code and got the idea that I could possibly add the proxy parameter to the requests that it performs myself. But then I got stuck, because I don't really understand what happens with the "TwitterSearchScraper" if it runs into a rate limit. Just wanted to watch the behaviour and let it scrape two days of Bitcoin during the hype end of 2017, must have been ten thousands of tweets, most certainly over 100.000. But it finished the task without running into any request limit like GOT3 did after it scraped approx. 10.000-15.000 tweets in a specific timeframe. Moreover it was blasting fast. So my question: Does it ever run into rate limits? Is a proxy or rotating proxies therefore even needed when I scrape several years of a specific term? How does it behave if it runs into a limit? Does it just stop the scraping then? If so, how could I force it to keep running with a new proxy?

I know that's many questions, but hope you find the time. Keep up the great work.

Thanks!

JustAnotherArchivist commented 4 years ago

I have not encountered any significant rate limit issues on Twitter. I had one related crash last month after 1.4 million results, but the resumption (using --cursor) worked fine for 7.9 million tweets, so that was probably rather a temporary problem on Twitter's side. I also haven't had issues running several scrapes in parallel.

snscrape tries to handle rate limit errors, but sometimes it can't and crashes. This is the case in particular when you get your IP banned on Instagram. Resuming the scrape is not possible on all modules; for Twitter specifically, you can use --cursor (if you know the cursor, which should appear somewhere in the crash dump file) or add an until:2019-01-01 term to the search. Regarding using a proxy, see #74.

WhiteLin3s commented 4 years ago

Alright, thanks so far. I mean my workers already only receive single days. There's a queue with all the single days of the timeframe that should be scraped, a worker gets assigned a day, scrapes the data for the query for this day and then the worker is able to get assigned a new day from the queue. It just shouldn't crash during scraping a single day. But still, if an exception gets raised, I just put the day back in the queue.

But my problem now is: I set everything up without using proxies, but I immediately kept running into "Unable to find guest token" errors. How does that come? What is it related to and how to fix it? And out of curiosity: where to find the crash dump files?

JustAnotherArchivist commented 4 years ago

Interesting. The crash I mentioned was also that error, but I could never reproduce it, even with multiple scrapes at once. It might well be a rate limit on the token generator. How many scrapes are you running in parallel?

The crash dump file is mentioned in a CRITICAL log message on the crash. They're put in the default directory for temporary files, which should be /tmp on most Unixoid systems.

WhiteLin3s commented 4 years ago

I started 10 docker containers in parallel. Probably they are producing too many request at once in fact. I will try to lower the number of workers.

Don't fully understand the token yet. Is it generated once when starting the scrape for a query? Or multiple times when there is pagination involved? If it is only generated once in the beginning, I could also try longer time intervals as tasks, e.g. weeks instead of days. That would yield less queries and therefore less initializations.

JustAnotherArchivist commented 4 years ago

It's requested once at the beginning and then again from time to time during pagination. I think it's usually every couple hundred requests, but I never did any detailed analysis on it. A staggered start instead of all at once might also be sufficient.

WhiteLin3s commented 4 years ago

Alright, thank you so far. I'm playing around. Btw I noticed how you can reproduce the error, or at least I think so: Let's say you use a python prompt and execute: import snscrape.modules.twitter scraper = snscrape.modules.twitter.TwitterSearchScraper("bitcoin") for tweet in scraper.get_items(): print(tweet.notExistingAttribute) So that the Scraper starts and then stops, can probably also be done by running it with valid code and then pressing Ctrl+C to interrupt while scraping. If you now try to start exactly this instance of the scraper again without creating a new one by e.g.: for tweet in scraper.get_items(): print(tweet.id) it gives you the "Unable to find guest token" exception. So calling the same Scraper instance multiple times seems to be problematic. Don't know yet how it behaves if the scrape was successful and complete. Will also check if there is a problem in my setup so that the same instance gets called multiple times.

JustAnotherArchivist commented 4 years ago

I see. Yeah, I never intended Scraper instances to be reused. It might work for some of them but is definitely unsupported. I recommend creating a new object instead. But if you really want to play with fire, in the case of the Twitter scrapers as of the current code, you could try scraper._unset_guest_token() before running the second loop.

JustAnotherArchivist commented 4 years ago

I didn't recall this previously, but if you're using AWS, you won't be able to scrape Twitter and would always run into the 'Unable to find guest token' error. See #79.

lukaspistelak commented 4 years ago
try:                                         
     tweets_objs=snscrape.modules.twitter.TwitterSearchScraper(query=q).get_items()                                 
except Exception as e:
      print("error search ",e)    

for tweet_obj in tweets_objs:                                        
       user_info_full_name=tweet_obj.user.displayname            

after cca 1000 calls: I GET THIS ERROR:

    for tweet_obj in tweets_objs:                                        
  File "/usr/local/lib/python3.6/dist-packages/snscrape/modules/twitter.py", line 446, in get_items
    for obj in self._iter_api_data('https://api.twitter.com/2/search/adaptive.json', params, paginationParams):
  File "/usr/local/lib/python3.6/dist-packages/snscrape/modules/twitter.py", line 230, in _iter_api_data
    obj = self._get_api_data(endpoint, reqParams)
  File "/usr/local/lib/python3.6/dist-packages/snscrape/modules/twitter.py", line 210, in _get_api_data
    self._ensure_guest_token()
  File "/usr/local/lib/python3.6/dist-packages/snscrape/modules/twitter.py", line 191, in _ensure_guest_token
    raise snscrape.base.ScraperException('Unable to find guest token')
snscrape.base.ScraperException: Unable to find guest token

i don't understand why is error reference to for loop of objects?......

JustAnotherArchivist commented 4 years ago

@lukaspistelak If that happens consistently, your proxy is probably getting blocked from Twitter. Also, get_items is a generator, which is why your try-except doesn't catch the error.

ppival commented 4 years ago

Hmm, I consistently get rate-limited with my twitter searches using snscrape from the command line! Usually I only get one block as below, but sometimes as many as three before it finally gives up and drops me to the command line. Is there anything I can do to avoid this? I'm running from within a virtual machine, fwiw...

> snscrape twitter-hashtag 'abolishice since:2018-06-15 until:2018-07-15' > abolishice_2018b.txt
2020-09-30 11:32:16.074 WARNING snscrape.base Error retrieving
https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_
blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_me
dia_tag=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_quote_count=tr
ue&include_reply_count=1&tweet_mode=extended&include_entities=true&include_user_entities=true&include_ext_medi
a_color=true&include_ext_media_availability=true&send_error_codes=true&simple_quoted_tweets=true&q=%23abolishi
ce+since%3A2018-06-15+until%3A2018-07-15&tweet_search_mode=live&count=100&query_source=spelling_expansion_reve
rt_click&cursor=scroll%3AthGAVUV0VFVBaKgLDB1fzgghwWgMC77cjz1aEcEjUAFQAlABEVuKs3FYCJehgHREVGQVVMVBUAFQAVARUAFQA
A&pc=1&spelling_corrections=1&ext=ext%3DmediaStats%252ChighlightedLabel: rate-limited, retrying
>
JustAnotherArchivist commented 4 years ago

@ppival Right, I should've explained it better: there are rate limits, but they're not really relevant and snscrape works around them. What you're describing is exactly what's supposed to happen on large scrapes. snscrape encounters a rate limit and retries the same cursor after a bit. As long as you only get WARNING lines (with 'retrying' at the end), everything's fine. snscrape will crash loudly otherwise, so you'll notice that. If you enable --verbose, you can see in more detail what's happening.

ppival commented 4 years ago

Oh that is SO much better! Any chance that last line could be included by default even in non-verbose mode?

Done, found 61223 results

@ppival Right, I should've explained it better: there are rate limits, but they're not really relevant and snscrape works around them. What you're describing is exactly what's supposed to happen on large scrapes. snscrape encounters a rate limit and retries the same cursor after a bit. As long as you only get WARNING lines (with 'retrying' at the end), everything's fine. snscrape will crash loudly otherwise, so you'll notice that. If you enable --verbose, you can see in more detail what's happening.

Tachevaz commented 4 years ago

I've been running into the "Unable to find guest token" exception a lot today - in the past 6 days, I've been able to scrape tens of thousands of users' timelines in a single run without using proxies or anything else, but today I can't really make it past ~90 requests at a time. I tried changing my IP address and including sleep time between requests but neither seems to work. Does anyone have suggestions for sending multiple user timeline requests w/o using a docker?

JustAnotherArchivist commented 4 years ago

@ppival snscrape follows the Rule of Silence. I might drop the level of the 'retrying' message from WARNING to INFO though. I also wouldn't be opposed to adding an option which enables printing the 'Done' message on stderr, but not by default.

@Tachevaz I can't reproduce this, even on an IP that is constantly accessing Twitter at a significant rate (not scraping though). A test scrape just now of about 93k tweets succeeded with 5 token extractions and no errors. I guess it must have something to do with the IP you're scraping from. It's interesting though that it blocks you only on the second(?) token request. 90 requests also seems a bit low; in my experience, it should be around 200 requests per token.