Closed renierts closed 2 years ago
I ran this scrape to over 34000 results and could not reproduce the crash.
KeyError Traceback (most recent call last)
Input In [31], in <cell line: 2>()
5 search = k + ' near:"New York" within:50km since:2021-01-01 until:2022-09-30'
6 for j in range(10):
----> 7 for i, tweet in enumerate(sntwitter.TwitterSearchScraper(search).get_items()):
8 if i > 5000:
9 break
File ~\anaconda3\lib\site-packages\snscrape\modules\twitter.py:1454, in TwitterSearchScraper.get_items(self)
1451 del paginationParams['tweet_search_mode']
1453 for obj in self._iter_api_data('https://api.twitter.com/2/search/adaptive.json', _TwitterAPIType.V2, params, paginationParams, cursor = self._cursor):
-> 1454 yield from self._v2_timeline_instructions_to_tweets(obj)
File ~\anaconda3\lib\site-packages\snscrape\modules\twitter.py:804, in _TwitterAPIScraper._v2_timeline_instructions_to_tweets(self, obj, includeConversationThreads)
802 for entry in entries:
803 if entry['entryId'].startswith('sq-I-t-') or entry['entryId'].startswith('tweet-'):
--> 804 yield from self._v2_instruction_tweet_entry_to_tweet(entry['entryId'], entry['content'], obj)
805 elif includeConversationThreads and entry['entryId'].startswith('conversationThread-') and not entry['entryId'].endswith('-show_more_cursor'):
806 for item in entry['content']['timelineModule']['items']:
File ~\anaconda3\lib\site-packages\snscrape\modules\twitter.py:827, in _TwitterAPIScraper._v2_instruction_tweet_entry_to_tweet(self, entryId, entry, obj)
825 else:
826 raise snscrape.base.ScraperException(f'Unable to handle entry {entryId!r}')
--> 827 yield self._tweet_to_tweet(tweet, obj)
File ~\anaconda3\lib\site-packages\snscrape\modules\twitter.py:1266, in _TwitterAPIScraper._tweet_to_tweet(self, tweet, obj)
1265 def _tweet_to_tweet(self, tweet, obj):
-> 1266 user = self._user_to_user(obj['globalObjects']['users'][tweet['user_id_str']])
1267 kwargs = {}
1268 if 'retweeted_status_id_str' in tweet:
File ~\anaconda3\lib\site-packages\snscrape\modules\twitter.py:1367, in _TwitterAPIScraper._user_to_user(self, user, id_)
1365 kwargs['profileImageUrl'] = user['profile_image_url_https']
1366 kwargs['profileBannerUrl'] = user.get('profile_banner_url')
-> 1367 if 'ext' in user and (label := user['ext']['highlightedLabel']['r']['ok'].get('label')):
1368 kwargs['label'] = self._user_label_to_user_label(label)
1369 return User(**kwargs)
KeyError: 'highlightedLabel'
I got the same error when trying to get tweets using keywords. snscaper version is 0.4.3.20220107.dev64+g59abeaf python version is 3.9.12
I also dont get the same number of tweets everytime I run the same query. Some times i get upto 15000 rors and sometimes only 50. In my latest attempt I got this error
I am getting this same error. Have used snscrape before and this did not happen. Is this a twitter issue?
Yeah, it sounds like a temporary issue with the data Twitter returns, namely that they omit the label sometimes (similar to #507). But until I get a reproducible example or a dump file, I can't fix this I'm afraid. I haven't seen it in my own scrapes.
Here is one of the dump files: snscrape_locals_4h1p8euj.txt
Thank you! I see what the problem is. Will fix tomorrow.
For clarification, it's not a temporary issue with Twitter's data. Rather, any users that have some 'extension' but no label cause a crash. In the example above, it's the NFT avatar nonsense, but there might be others, too.
@JustAnotherArchivist
How can we skip this crash? Thanks!
@koufopoulosf By waiting for the commit that fixes it. :-)
@JustAnotherArchivist Can't wait for this. Is there any quick fix maybe?
@koufopoulosf Please get in touch with an offer for your desired level of paid support then...
@JustAnotherArchivist Do you have an email?
@koufopoulosf Sure do, but like you, I don't want to post that publicly. You can find me on IRC and Reddit, see here.
To get back to the issue at hand: I have implemented a fix locally, but unfortunately I'm still unable to reproduce the bug and therefore test the fix properly. Maybe Twitter is doing A/B testing with this? Will look into it further on the weekend. If anyone has more details on how to reproduce this, please let me know.
To get back to the issue at hand: I have implemented a fix locally, but unfortunately I'm still unable to reproduce the bug and therefore test the fix properly. Maybe Twitter is doing A/B testing with this? Will look into it further on the weekend. If anyone has more details on how to reproduce this, please let me know.
Hey, yes! I just noticed as well! I don't have this issue at the moment while previously all my scrapers crashed. Now it seems it just skips values, as I've made a setup to batch insert every 5000 iterations and it ends up not storing x*5000 tweets but a few less (maybe skips the ones with the issue?). I am not sure what's happening.
p.s. @JustAnotherArchivist I've sent you a message on reddit :)
@JustAnotherArchivist If you add a counter, you will see the difference between the data that was supposed this library to scrape and the actual number of scraped data.
However it wouldn't crash yesterday for some time and now it still crashes....
Thing is, it crashes even though I am not even fetching the label value which is the one that causes the crash. Is it possible that it would work if I just removed everything related to the label value on the library's code?
That would be a quick fix, but there might be a lot of code that relies on it.
@TheTechRobo I think it's unavoidable... I wonder whether the lib would be faster if the unecessary fetches were disabled.. I'll give it a try
@JustAnotherArchivist I'm watching the code, I think I've found how to increase the performance :) I'm going to do some testing
@koufopoulosf I have no idea how a counter is supposed to help. The Twitter module processes all tweets returned by Twitter. If anything gets skipped, a warning log message is emitted. If anything goes wrong, it blows up with an error. Counting being off does not help with debugging this in any way. Twitter does sometimes return fewer results than you requested, by the way. snscrape doesn't care and just continues through the pagination.
There are no unnecessary fetches. Twitter returns all that information at once. snscrape's requests are pretty much the ones made by Twitter's web interface as well, and that's intentional. Modifying the parameters could prevent this exception but is not desirable.
(Also, I didn't get a message from you on Reddit.)
@JustAnotherArchivist I mentioned the counter because I used a counter under snscrape and when it stopped scraping, the number of fetched results were not the same in number as the counter. I don't know why. I'll test it again in a few hours.
Regarding reddit, I'm pretty sure that I've sent you one. Have you checked all of your messages?
Regarding reddit, I'm pretty sure that I've sent you one. Have you checked all of your messages?
Did you send it via chat or PMs? There's a difference, and chat is only available on new Reddit, which not everybody uses.
@koufopoulosf I don't understand what you were doing exactly with counters, but it doesn't sound right. In any case, it's unrelated to this crash. snscrape does not silently swallow errors and omit results.
I checked the messages on the account linked there, yes. Either you sent it to the wrong user, or you used the chat or whatever thing Reddit invented this week, not messages.
Ah that should be my bad.. I'm getting back to pc in a while. I'll try all the messaging options.
@JustAnotherArchivist May you please share some quick fix regarding this issue? Like which line should I remove safely in order to avoid this issue? I'm not fetching any label value (basides sourceLabel) and it still crashes... I thought Twitter was doing some testing but it seems that it still persists.. and I'm asking this because you wrote that you already have implemented this fix locally, would it be possible to share it here maybe? I would really appreciate it. Thanks.
edit: Also I've noticed that:
After running the script between 2 dates, it doesn't always fetch the same number of tweets every time. It misses around 1 every 1000 tweets the next time I run the script.
After running the script the time it breaks (because of this known issue), it seems to not break again on the same tweets, though the issue will persist later on and break again.
After running the script between 2 dates, it doesn't always fetch the same number of tweets every time. It misses around 1 every 1000 tweets the next time I run the script.
Probably on Twitter's end, unless they're doing something weird with the pagination again.
I'm not fetching any label value
You can't control that. snscrape always retrieves and parses that data, whether you access/use it later or not. And it crashes at the parsing step.
I'm asking this because you wrote that you already have implemented this fix locally, would it be possible to share it here maybe?
I'll commit it in a second. The fix should be safe even though I can't really test it.
Also I've noticed that: ...
File that under 'Twitter's search is weird'.
@JustAnotherArchivist @TheTechRobo Thank you very much guys. I'll test the fix and I hope that it will stop the crashing. 🙏
@JustAnotherArchivist @TheTechRobo I tested the fix on multiple queries that previously crashed and I confirm that it indeed works. Thank you very much!
Thanks for your work @JustAnotherArchivist! FYI, your work is awesome and used by many researchers. I am using this for a public health study - I will include citations as per:
Author: JustAnotherArchivist Title: snscrape: A social networking service scraper in Python URL: https://github.com/JustAnotherArchivist/snscrape Version: the version you used, e.g. v0.3.4, or the commit ID if you used a code that doesn't have a version number, e.g. commit https://github.com/JustAnotherArchivist/snscrape/commit/97c8caea48b1d79f776f8a2aabcadab34d633b5c
Hello,
I get the above error after scraping around 13000 entries:
sntwitter.TwitterSearchScraper('bitcoin since:2000-01-01 until:2022-07-01').get_items()