bellingcat / cisticola

Coordinates scrapers and interfaces with database
17 stars 1 forks source link

Twitter scraper crashes in certain cases #30

Open loganwilliams opened 2 years ago

loganwilliams commented 2 years ago
2022-04-02 13:43:48.780 | ERROR    | cisticola.scraper.base:scrape_channels:398 - An error has been caught in function 'scrape_channels', process 'MainProcess' (63504), thread 'MainThread' (139778754606912):
Traceback (most recent call last):

  File "/home/loganw/cisticola/app.py", line 136, in <module>
    scrape_channels(args)
    │               └ Namespace(command='scrape-channels', gsheet=None, media=False)
    └ <function scrape_channels at 0x7f20b9d52a60>

  File "/home/loganw/cisticola/app.py", line 100, in scrape_channels
    controller.scrape_all_channels(archive_media = args.media)
    │          │                                   │    └ False
    │          │                                   └ Namespace(command='scrape-channels', gsheet=None, media=False)
    │          └ <function ScraperController.scrape_all_channels at 0x7f20bb9fd280>
    └ <cisticola.scraper.base.ScraperController object at 0x7f20b9d1bcd0>

  File "/home/loganw/cisticola/cisticola/scraper/base.py", line 332, in scrape_all_channels
    return self.scrape_channels(channels, archive_media=archive_media)
           │    │               │                       └ False
           │    │               └ [Channel(name='Qanonfighters', platform_id='1412770923', category='qanon', platform='Telegram', url='https://ttttt.me/qanonfi...
           │    └ <function ScraperController.scrape_channels at 0x7f20bb9fd3a0>
           └ <cisticola.scraper.base.ScraperController object at 0x7f20b9d1bcd0>

> File "/home/loganw/cisticola/cisticola/scraper/base.py", line 398, in scrape_channels
    for post in posts:
        │       └ <generator object TwitterScraper.get_posts at 0x7f20b78a1eb0>
        └ ScraperResult(scraper='TwitterScraper 0.0.1', platform='Twitter', channel=196, platform_id='1493714591309873158', date=dateti...

  File "/home/loganw/cisticola/cisticola/scraper/twitter.py", line 26, in get_posts
    for tweet in scraper.get_items():
        │        │       └ <function TwitterProfileScraper.get_items at 0x7f20ba836790>
        │        └ <snscrape.modules.twitter.TwitterProfileScraper object at 0x7f20b788d4f0>
        └ Tweet(url='https://twitter.com/rdh_des/status/1493714591309873158', date=datetime.datetime(2022, 2, 15, 22, 31, 25, tzinfo=da...

  File "/home/loganw/.local/share/virtualenvs/cisticola-BRujq-3x/lib/python3.9/site-packages/snscrape/modules/twitter.py", line 933, in get_items
    yield from self._graphql_timeline_instructions_to_tweets(instructions)
               │    │                                        └ [{'type': 'TimelineClearCache'}, {'type': 'TimelineAddEntries', 'entries': [{'entryId': 'tweet-1493714591309873158', 'sortInd...
               │    └ <function _TwitterAPIScraper._graphql_timeline_instructions_to_tweets at 0x7f20ba839e50>
               └ <snscrape.modules.twitter.TwitterProfileScraper object at 0x7f20b788d4f0>
  File "/home/loganw/.local/share/virtualenvs/cisticola-BRujq-3x/lib/python3.9/site-packages/snscrape/modules/twitter.py", line 678, in _graphql_timeline_instructions_to_tweets
    yield self._graphql_timeline_tweet_item_result_to_tweet(entry['content']['itemContent']['tweet_results']['result'])
          │    │                                            └ {'entryId': 'tweet-1470476515750076422', 'sortIndex': '1470476515750076422', 'content': {'entryType': 'TimelineTimelineItem',...
          │    └ <function _TwitterAPIScraper._graphql_timeline_tweet_item_result_to_tweet at 0x7f20ba839dc0>
          └ <snscrape.modules.twitter.TwitterProfileScraper object at 0x7f20b788d4f0>
  File "/home/loganw/.local/share/virtualenvs/cisticola-BRujq-3x/lib/python3.9/site-packages/snscrape/modules/twitter.py", line 668, in _graphql_timeline_tweet_item_result_to_tweet
    kwargs['card'] = self._make_card(result['card'], _TwitterAPIType.GRAPHQL)
    │                │    │          │               │               └ <_TwitterAPIType.GRAPHQL: 1>
    │                │    │          │               └ <enum '_TwitterAPIType'>
    │                │    │          └ {'__typename': 'Tweet', 'rest_id': '1470476515750076422', 'core': {'user_results': {'result': {'__typename': 'User', 'id': 'V...
    │                │    └ <function _TwitterAPIScraper._make_card at 0x7f20ba839ca0>
    │                └ <snscrape.modules.twitter.TwitterProfileScraper object at 0x7f20b788d4f0>
    └ {}
  File "/home/loganw/.local/share/virtualenvs/cisticola-BRujq-3x/lib/python3.9/site-packages/snscrape/modules/twitter.py", line 630, in _make_card
    return Card(**cardKwargs)
           │      └ {'url': 'https://t.co/ep5EvwSAdA'}
           └ <class 'snscrape.modules.twitter.Card'>

TypeError: __init__() missing 1 required positional argument: 'title'
loganwilliams commented 2 years ago
2022-04-02 13:43:43.000 | ERROR    | cisticola.scraper.base:scrape_channels:398 - An error has been caught in function 'scrape_channels', process 'MainProcess' (63504), thread 'MainThread' (139778754606912):
Traceback (most recent call last):

  File "/home/loganw/cisticola/app.py", line 136, in <module>
    scrape_channels(args)
    │               └ Namespace(command='scrape-channels', gsheet=None, media=False)
    └ <function scrape_channels at 0x7f20b9d52a60>

  File "/home/loganw/cisticola/app.py", line 100, in scrape_channels
    controller.scrape_all_channels(archive_media = args.media)
    │          │                                   │    └ False
    │          │                                   └ Namespace(command='scrape-channels', gsheet=None, media=False)
    │          └ <function ScraperController.scrape_all_channels at 0x7f20bb9fd280>
    └ <cisticola.scraper.base.ScraperController object at 0x7f20b9d1bcd0>

  File "/home/loganw/cisticola/cisticola/scraper/base.py", line 332, in scrape_all_channels
    return self.scrape_channels(channels, archive_media=archive_media)
           │    │               │                       └ False
           │    │               └ [Channel(name='Qanonfighters', platform_id='1412770923', category='qanon', platform='Telegram', url='https://ttttt.me/qanonfi...
           │    └ <function ScraperController.scrape_channels at 0x7f20bb9fd3a0>
           └ <cisticola.scraper.base.ScraperController object at 0x7f20b9d1bcd0>

> File "/home/loganw/cisticola/cisticola/scraper/base.py", line 398, in scrape_channels
    for post in posts:
        │       └ <generator object TwitterScraper.get_posts at 0x7f20b71f2c10>
        └ ScraperResult(scraper='TwitterScraper 0.0.1', platform='Twitter', channel=189, platform_id='1389019453803933697', date=dateti...

  File "/home/loganw/cisticola/cisticola/scraper/twitter.py", line 26, in get_posts
    for tweet in scraper.get_items():
        │        │       └ <function TwitterProfileScraper.get_items at 0x7f20ba836790>
        │        └ <snscrape.modules.twitter.TwitterProfileScraper object at 0x7f20b74a4d90>
        └ Tweet(url='https://twitter.com/MohamedDiallo_/status/1389019453803933697', date=datetime.datetime(2021, 5, 3, 0, 50, 19, tzin...

  File "/home/loganw/.local/share/virtualenvs/cisticola-BRujq-3x/lib/python3.9/site-packages/snscrape/modules/twitter.py", line 933, in get_items
    yield from self._graphql_timeline_instructions_to_tweets(instructions)
               │    │                                        └ [{'type': 'TimelineClearCache'}, {'type': 'TimelineAddEntries', 'entries': [{'entryId': 'tweet-1509192130408980498', 'sortInd...
               │    └ <function _TwitterAPIScraper._graphql_timeline_instructions_to_tweets at 0x7f20ba839e50>
               └ <snscrape.modules.twitter.TwitterProfileScraper object at 0x7f20b74a4d90>
  File "/home/loganw/.local/share/virtualenvs/cisticola-BRujq-3x/lib/python3.9/site-packages/snscrape/modules/twitter.py", line 678, in _graphql_timeline_instructions_to_tweets
    yield self._graphql_timeline_tweet_item_result_to_tweet(entry['content']['itemContent']['tweet_results']['result'])
          │    │                                            └ {'entryId': 'tweet-1509192130408980498', 'sortIndex': '1509192130408980498', 'content': {'entryType': 'TimelineTimelineItem',...
          │    └ <function _TwitterAPIScraper._graphql_timeline_tweet_item_result_to_tweet at 0x7f20ba839dc0>
          └ <snscrape.modules.twitter.TwitterProfileScraper object at 0x7f20b74a4d90>
  File "/home/loganw/.local/share/virtualenvs/cisticola-BRujq-3x/lib/python3.9/site-packages/snscrape/modules/twitter.py", line 668, in _graphql_timeline_tweet_item_result_to_tweet
    kwargs['card'] = self._make_card(result['card'], _TwitterAPIType.GRAPHQL)
    │                │    │          │               │               └ <_TwitterAPIType.GRAPHQL: 1>
    │                │    │          │               └ <enum '_TwitterAPIType'>
    │                │    │          └ {'__typename': 'Tweet', 'rest_id': '1509192130408980498', 'core': {'user_results': {'result': {'__typename': 'User', 'id': 'V...
    │                │    └ <function _TwitterAPIScraper._make_card at 0x7f20ba839ca0>
    │                └ <snscrape.modules.twitter.TwitterProfileScraper object at 0x7f20b74a4d90>
    └ {}
  File "/home/loganw/.local/share/virtualenvs/cisticola-BRujq-3x/lib/python3.9/site-packages/snscrape/modules/twitter.py", line 630, in _make_card
    return Card(**cardKwargs)
           │      └ {'url': 'https://twitter.com'}
           └ <class 'snscrape.modules.twitter.Card'>

TypeError: __init__() missing 1 required positional argument: 'title'
2022-04-02 13:43:43.067 | INFO     | cisticola.scraper.base:scrape_channels:407 - TwitterScraper 0.0.1 found 1 new posts from Channel(name='Mohamed Diallo Live', platform_id=None, category='qanon_content', platform='Twitter', url='https://twitter.com/MohamedDiallo_', screenname='MohamedDiallo_', country='FR(?)', influencer='Mohammed Diallo', public=True, chat=False, notes=None, source='researcher')
trislee commented 2 years ago

This bug is discussed in https://github.com/JustAnotherArchivist/snscrape/issues/407

loganwilliams commented 2 years ago

Great, I'll wait for changes from justanotherarchivist.