Open Meorge opened 2 years ago
Well, recently JustAnotherArchivist has added an amazing feature. snscrape now reuses guest tokens across sessions. This prevents rate-limiting from burning through too many guest tokens.
Personally I have not been banned, and I've been downloading thousands of tweets recursively...
I recommend creating an "Ignore File" and writing the tweet data for every 5 tweets or so, so that you don't have to redo all your progress if/when (2.2mil tweets is a lot) you get banned. I did something similar in my recursive tweet downloading script.
Based on my past experience, that should be fine. I've scraped many millions of tweets before in parallel without problems. I usually split such big runs up into monthly scrapes using a search query like keyword since:2022-01-01 until:2022-02-01
to fetch tweets from this January. Then I iterate over the months and finally check whether each monthly output file contains the expected results (e.g. whether the last result is close to midnight on the 1st).
See also #307
Thank you for the info! Perhaps I'm in the minority on this, but it might be helpful to others to include this sort of anecdotal information somewhere, so that people have a better idea on how much they can expect to use snscrape before being in danger of getting banned/rate limited? (Apologies if it's already available somewhere and I just didn't see it!)
It's mentioned in some issues but not prominently. Documentation is WIP, and I agree it may be worth including some vague notes about it there.
I have been rate-limited by IP with a cooldown of half an hour to an hour; AFAICT it is not possible to get banned.
I've never been rate-limited before, at least that I've noticed.
my memory is hazy but it took... maybe 10-100 concurrent threads
Ok then that explains it, lol. I only have a few at a time
4 requests to https://api.twitter.com/2/search/adaptive.json? include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&include_ext_has_nft_avatar=1&include_ext_is_blue_verified=1&include_ext_verified_type=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_ext_limited_action_results=false&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_ext_collab_control=true&include_ext_views=true&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&include_ext_sensitive_media_warning=true&include_ext_trusted_friends_metadata=true&send_error_codes=true&simple_quoted_tweet=true&q=from%3A____lifestyle&tweet_search_mode=live&count=20&query_source=spelling_expansion_revert_click&pc=1&spelling_corrections=1&include_ext_edit_control=true&ext=mediaStats%2ChighlightedLabel%2ChasNftAvatar%2CvoiceInfo%2Cenrichments%2CsuperFollowMetadata%2CunmentionInfo%2CeditControl%2Ccollab_control%2Cvibe failed, giving up. Errors: blocked (403), blocked (403), blocked (403), blocked (403)
snscrape was working fine suddenly I am getting these errors can i know why and how to fix it
Thanks
At a high level, we are working to prevent these accounts from 1) scraping people’s public Twitter data to build AI models and 2) manipulating people and conversation on the platform in various ways.
https://business.twitter.com/en/blog/update-on-twitters-limited-usage.html
so even if it wasn't problem before it could happen problem now
I'm currently trying to use snscrape to download Tweets from Twitter. According to my calculations, I should be getting around 2,200,000 Tweets in total by the time it finishes. I'm concerned about the possibility of getting IP banned from Twitter as a result of this. Is this something worth being concerned about, or should I not worry?
More generally:
This tool seems like a godsend, compared to the limits of the official Twitter API. Having a more solid understanding of the "safe zone" would make me feel more comfortable with using it. I know that maintainers can't guarantee anything about rate limits or IP bans, but if anyone has experience with where they begin to set in, knowing that would help a lot!