JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.51k stars 712 forks source link

KeyError: 'highlightedLabel' #559

Closed renierts closed 2 years ago

renierts commented 2 years ago
Traceback (most recent call last):
  File "/lustre/scratch2/ws/1/s2575425-twitter-sentiment-analysis/./sentiment.py", line 56, in <module>
    for i,tweet in tqdm(enumerate(sntwitter.TwitterSearchScraper('bitcoin since:2000-01-01 until:2022-07-01').get_items())):
  File "/lustre/scratch2/ws/1/s2575425-twitter-sentiment-analysis/.venv/lib/python3.10/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/lustre/scratch2/ws/1/s2575425-twitter-sentiment-analysis/.venv/lib/python3.10/site-packages/snscrape/modules/twitter.py", line 1454, in get_items
    yield from self._v2_timeline_instructions_to_tweets(obj)
  File "/lustre/scratch2/ws/1/s2575425-twitter-sentiment-analysis/.venv/lib/python3.10/site-packages/snscrape/modules/twitter.py", line 804, in _v2_timeline_instructions_to_tweets
    yield from self._v2_instruction_tweet_entry_to_tweet(entry['entryId'], entry['content'], obj)
  File "/lustre/scratch2/ws/1/s2575425-twitter-sentiment-analysis/.venv/lib/python3.10/site-packages/snscrape/modules/twitter.py", line 827, in _v2_instruction_tweet_entry_to_tweet
    yield self._tweet_to_tweet(tweet, obj)
  File "/lustre/scratch2/ws/1/s2575425-twitter-sentiment-analysis/.venv/lib/python3.10/site-packages/snscrape/modules/twitter.py", line 1266, in _tweet_to_tweet
    user = self._user_to_user(obj['globalObjects']['users'][tweet['user_id_str']])
  File "/lustre/scratch2/ws/1/s2575425-twitter-sentiment-analysis/.venv/lib/python3.10/site-packages/snscrape/modules/twitter.py", line 1367, in _user_to_user
    if 'ext' in user and (label := user['ext']['highlightedLabel']['r']['ok'].get('label')):
KeyError: 'highlightedLabel'

Hello,

I get the above error after scraping around 13000 entries: sntwitter.TwitterSearchScraper('bitcoin since:2000-01-01 until:2022-07-01').get_items()

JustAnotherArchivist commented 2 years ago

I ran this scrape to over 34000 results and could not reproduce the crash.

vighnesh-balaji commented 2 years ago
KeyError                                  Traceback (most recent call last)
Input In [31], in <cell line: 2>()
      5 search = k + ' near:"New York" within:50km since:2021-01-01 until:2022-09-30'
      6 for j in range(10):
----> 7     for i, tweet in enumerate(sntwitter.TwitterSearchScraper(search).get_items()):
      8         if i > 5000:
      9             break

File ~\anaconda3\lib\site-packages\snscrape\modules\twitter.py:1454, in TwitterSearchScraper.get_items(self)
   1451     del paginationParams['tweet_search_mode']
   1453 for obj in self._iter_api_data('https://api.twitter.com/2/search/adaptive.json', _TwitterAPIType.V2, params, paginationParams, cursor = self._cursor):
-> 1454     yield from self._v2_timeline_instructions_to_tweets(obj)

File ~\anaconda3\lib\site-packages\snscrape\modules\twitter.py:804, in _TwitterAPIScraper._v2_timeline_instructions_to_tweets(self, obj, includeConversationThreads)
    802 for entry in entries:
    803     if entry['entryId'].startswith('sq-I-t-') or entry['entryId'].startswith('tweet-'):
--> 804         yield from self._v2_instruction_tweet_entry_to_tweet(entry['entryId'], entry['content'], obj)
    805     elif includeConversationThreads and entry['entryId'].startswith('conversationThread-') and not entry['entryId'].endswith('-show_more_cursor'):
    806         for item in entry['content']['timelineModule']['items']:

File ~\anaconda3\lib\site-packages\snscrape\modules\twitter.py:827, in _TwitterAPIScraper._v2_instruction_tweet_entry_to_tweet(self, entryId, entry, obj)
    825 else:
    826     raise snscrape.base.ScraperException(f'Unable to handle entry {entryId!r}')
--> 827 yield self._tweet_to_tweet(tweet, obj)

File ~\anaconda3\lib\site-packages\snscrape\modules\twitter.py:1266, in _TwitterAPIScraper._tweet_to_tweet(self, tweet, obj)
   1265 def _tweet_to_tweet(self, tweet, obj):
-> 1266     user = self._user_to_user(obj['globalObjects']['users'][tweet['user_id_str']])
   1267     kwargs = {}
   1268     if 'retweeted_status_id_str' in tweet:

File ~\anaconda3\lib\site-packages\snscrape\modules\twitter.py:1367, in _TwitterAPIScraper._user_to_user(self, user, id_)
   1365 kwargs['profileImageUrl'] = user['profile_image_url_https']
   1366 kwargs['profileBannerUrl'] = user.get('profile_banner_url')
-> 1367 if 'ext' in user and (label := user['ext']['highlightedLabel']['r']['ok'].get('label')):
   1368     kwargs['label'] = self._user_label_to_user_label(label)
   1369 return User(**kwargs)

KeyError: 'highlightedLabel'

I got the same error when trying to get tweets using keywords. snscaper version is 0.4.3.20220107.dev64+g59abeaf python version is 3.9.12

I also dont get the same number of tweets everytime I run the same query. Some times i get upto 15000 rors and sometimes only 50. In my latest attempt I got this error

kswanjitsu commented 2 years ago

I am getting this same error. Have used snscrape before and this did not happen. Is this a twitter issue?

JustAnotherArchivist commented 2 years ago

Yeah, it sounds like a temporary issue with the data Twitter returns, namely that they omit the label sometimes (similar to #507). But until I get a reproducible example or a dump file, I can't fix this I'm afraid. I haven't seen it in my own scrapes.

kswanjitsu commented 2 years ago

Here is one of the dump files: snscrape_locals_4h1p8euj.txt

JustAnotherArchivist commented 2 years ago

Thank you! I see what the problem is. Will fix tomorrow.

JustAnotherArchivist commented 2 years ago

For clarification, it's not a temporary issue with Twitter's data. Rather, any users that have some 'extension' but no label cause a crash. In the example above, it's the NFT avatar nonsense, but there might be others, too.

koufopoulosf commented 2 years ago

@JustAnotherArchivist

How can we skip this crash? Thanks!

JustAnotherArchivist commented 2 years ago

@koufopoulosf By waiting for the commit that fixes it. :-)

koufopoulosf commented 2 years ago

@JustAnotherArchivist Can't wait for this. Is there any quick fix maybe?

JustAnotherArchivist commented 2 years ago

@koufopoulosf Please get in touch with an offer for your desired level of paid support then...

koufopoulosf commented 2 years ago

@JustAnotherArchivist Do you have an email?

JustAnotherArchivist commented 2 years ago

@koufopoulosf Sure do, but like you, I don't want to post that publicly. You can find me on IRC and Reddit, see here.

JustAnotherArchivist commented 2 years ago

To get back to the issue at hand: I have implemented a fix locally, but unfortunately I'm still unable to reproduce the bug and therefore test the fix properly. Maybe Twitter is doing A/B testing with this? Will look into it further on the weekend. If anyone has more details on how to reproduce this, please let me know.

koufopoulosf commented 2 years ago

To get back to the issue at hand: I have implemented a fix locally, but unfortunately I'm still unable to reproduce the bug and therefore test the fix properly. Maybe Twitter is doing A/B testing with this? Will look into it further on the weekend. If anyone has more details on how to reproduce this, please let me know.

Hey, yes! I just noticed as well! I don't have this issue at the moment while previously all my scrapers crashed. Now it seems it just skips values, as I've made a setup to batch insert every 5000 iterations and it ends up not storing x*5000 tweets but a few less (maybe skips the ones with the issue?). I am not sure what's happening.

koufopoulosf commented 2 years ago

p.s. @JustAnotherArchivist I've sent you a message on reddit :)

koufopoulosf commented 2 years ago

@JustAnotherArchivist If you add a counter, you will see the difference between the data that was supposed this library to scrape and the actual number of scraped data.

However it wouldn't crash yesterday for some time and now it still crashes....

Thing is, it crashes even though I am not even fetching the label value which is the one that causes the crash. Is it possible that it would work if I just removed everything related to the label value on the library's code?

TheTechRobo commented 2 years ago

That would be a quick fix, but there might be a lot of code that relies on it.

koufopoulosf commented 2 years ago

@TheTechRobo I think it's unavoidable... I wonder whether the lib would be faster if the unecessary fetches were disabled.. I'll give it a try

koufopoulosf commented 2 years ago

@JustAnotherArchivist I'm watching the code, I think I've found how to increase the performance :) I'm going to do some testing

JustAnotherArchivist commented 2 years ago

@koufopoulosf I have no idea how a counter is supposed to help. The Twitter module processes all tweets returned by Twitter. If anything gets skipped, a warning log message is emitted. If anything goes wrong, it blows up with an error. Counting being off does not help with debugging this in any way. Twitter does sometimes return fewer results than you requested, by the way. snscrape doesn't care and just continues through the pagination.

There are no unnecessary fetches. Twitter returns all that information at once. snscrape's requests are pretty much the ones made by Twitter's web interface as well, and that's intentional. Modifying the parameters could prevent this exception but is not desirable.

(Also, I didn't get a message from you on Reddit.)

koufopoulosf commented 2 years ago

@JustAnotherArchivist I mentioned the counter because I used a counter under snscrape and when it stopped scraping, the number of fetched results were not the same in number as the counter. I don't know why. I'll test it again in a few hours.

Regarding reddit, I'm pretty sure that I've sent you one. Have you checked all of your messages?

TheTechRobo commented 2 years ago

Regarding reddit, I'm pretty sure that I've sent you one. Have you checked all of your messages?

Did you send it via chat or PMs? There's a difference, and chat is only available on new Reddit, which not everybody uses.

JustAnotherArchivist commented 2 years ago

@koufopoulosf I don't understand what you were doing exactly with counters, but it doesn't sound right. In any case, it's unrelated to this crash. snscrape does not silently swallow errors and omit results.

I checked the messages on the account linked there, yes. Either you sent it to the wrong user, or you used the chat or whatever thing Reddit invented this week, not messages.

koufopoulosf commented 2 years ago

Ah that should be my bad.. I'm getting back to pc in a while. I'll try all the messaging options.

koufopoulosf commented 2 years ago

@JustAnotherArchivist May you please share some quick fix regarding this issue? Like which line should I remove safely in order to avoid this issue? I'm not fetching any label value (basides sourceLabel) and it still crashes... I thought Twitter was doing some testing but it seems that it still persists.. and I'm asking this because you wrote that you already have implemented this fix locally, would it be possible to share it here maybe? I would really appreciate it. Thanks.

edit: Also I've noticed that:

  1. After running the script between 2 dates, it doesn't always fetch the same number of tweets every time. It misses around 1 every 1000 tweets the next time I run the script.

  2. After running the script the time it breaks (because of this known issue), it seems to not break again on the same tweets, though the issue will persist later on and break again.

TheTechRobo commented 2 years ago

After running the script between 2 dates, it doesn't always fetch the same number of tweets every time. It misses around 1 every 1000 tweets the next time I run the script.

Probably on Twitter's end, unless they're doing something weird with the pagination again.

JustAnotherArchivist commented 2 years ago

I'm not fetching any label value

You can't control that. snscrape always retrieves and parses that data, whether you access/use it later or not. And it crashes at the parsing step.

I'm asking this because you wrote that you already have implemented this fix locally, would it be possible to share it here maybe?

I'll commit it in a second. The fix should be safe even though I can't really test it.

Also I've noticed that: ...

File that under 'Twitter's search is weird'.

koufopoulosf commented 2 years ago

@JustAnotherArchivist @TheTechRobo Thank you very much guys. I'll test the fix and I hope that it will stop the crashing. 🙏

koufopoulosf commented 2 years ago

@JustAnotherArchivist @TheTechRobo I tested the fix on multiple queries that previously crashed and I confirm that it indeed works. Thank you very much!

kswanjitsu commented 2 years ago

Thanks for your work @JustAnotherArchivist! FYI, your work is awesome and used by many researchers. I am using this for a public health study - I will include citations as per:

Author: JustAnotherArchivist Title: snscrape: A social networking service scraper in Python URL: https://github.com/JustAnotherArchivist/snscrape Version: the version you used, e.g. v0.3.4, or the commit ID if you used a code that doesn't have a version number, e.g. commit https://github.com/JustAnotherArchivist/snscrape/commit/97c8caea48b1d79f776f8a2aabcadab34d633b5c