JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.33k stars 699 forks source link

KeyError for Tweet ID on Twitter scrapes #61

Closed JustAnotherArchivist closed 3 years ago

JustAnotherArchivist commented 4 years ago

Reported by @ivan on IRC:

https://gist.github.com/ivan/b481a0cd141a459ab47566b3b890f3a4

# snscrape twitter-search COVID-19
2020-04-08 00:35:40.829  CRITICAL  snscrape.cli  Dumped stack and locals to /tmp/snscrape_locals_9hg1a3n7
Traceback (most recent call last):
  File "/nix/store/51x3j4jzkdxlj9qpp8sb15h49dn81k1a-python3.7-snscrape-0.3.1/bin/.snscrape-wrapped", line 9, in <module>
    sys.exit(main())
  File "/nix/store/51x3j4jzkdxlj9qpp8sb15h49dn81k1a-python3.7-snscrape-0.3.1/lib/python3.7/site-packages/snscrape/cli.py", line 224, in main
    for i, item in enumerate(scraper.get_items(), start = 1):
  File "/nix/store/51x3j4jzkdxlj9qpp8sb15h49dn81k1a-python3.7-snscrape-0.3.1/lib/python3.7/site-packages/snscrape/modules/twitter.py", line 183, in get_items
    tweet = obj['globalObjects']['tweets'][entry['content']['item']['content']['tweet']['id']]
KeyError: '1247595173061804038'
Relevant part of the response data's `timeline.instructions[0].addEntries.entries` ``` { "entryId": "sq-I-t-1247595173103767552", "sortIndex": "-1518030", "content": { "item": { "content": { "tweet": { "id": "1247595173103767552", "displayType": "Tweet", "highlights": { "textHighlights": [ { "startIndex": 73, "endIndex": 80 } ] } } }, "clientEventInfo": { "component": "result", "element": "tweet", "details": { "timelinesDetails": { "controllerData": "" } } } } } }, { "entryId": "sq-I-t-1247595173061804038", "sortIndex": "-1518040", "content": { "item": { "content": { "tweet": { "id": "1247595173061804038", "displayType": "Tweet", "highlights": { "textHighlights": [ { "startIndex": 203, "endIndex": 210 } ] } } }, "clientEventInfo": { "component": "result", "element": "tweet", "details": { "timelinesDetails": { "controllerData": "" } } } } } }, { "entryId": "sq-I-t-1247595172952780801", "sortIndex": "-1518050", "content": { "item": { "content": { "tweet": { "id": "1247595172952780801", "displayType": "Tweet", "highlights": { "textHighlights": [ { "startIndex": 259, "endIndex": 267 } ] } } }, "clientEventInfo": { "component": "result", "element": "tweet", "details": { "timelinesDetails": { "controllerData": "" } } } } } }, ```

That Tweet also wasn't in the previous output, i.e. was not a search result itself. By the time I investigated this (10-15 minutes after the crash occurred), it was already deleted or otherwise unavailable. My guess is that it was quoted by another search result and contained in a previous response's globalObjects for that reason. If anyone finds a reproducible and preferably small test case for this, please let me know. Until then, this can't really be fixed.

JustAnotherArchivist commented 3 years ago

After implementing #5, I found an easily reproducible example of this bug: snscrape twitter-profile textilfes crashes with KeyError: '1288815875856924672'. 'This Tweet is from an account that no longer exists.' The tweet data is not included in a previous globalObjects. The profile page seems to just not display anything for that tweet; it should appear between 1288816073232519170 and 1288815476953448448. The same thing is true for tweet 1286310057601368065, which came from an account that has since been suspended.

I'll simply ignore these and log a warning.