bellingcat / cisticola

Coordinates scrapers and interfaces with database
15 stars 0 forks source link

Rumble scraper crashes before it finishes scraping channel #28

Closed loganwilliams closed 2 years ago

loganwilliams commented 2 years ago
2022-04-02 13:33:02.042 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/c/deqodeurs?page=1 succeeded on attempt: 0/5
2022-04-02 13:33:02.586 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vz9y71-le-naufrage-invitable....html succeeded on attempt: 0/5
2022-04-02 13:33:03.122 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vyx0yj-tentations....html succeeded on attempt: 0/5
2022-04-02 13:33:03.728 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vyf4v9-faux-drapeaux-et-fraudes-lectorales....html succeeded on attempt: 0/5
2022-04-02 13:33:04.268 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vy4k4l-on-gagne-mais....html succeeded on attempt: 0/5
2022-04-02 13:33:04.825 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vxrib9-cest-rel-et-imminent-.html succeeded on attempt: 0/5
2022-04-02 13:33:05.370 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vxg6lz-quand-ils-arrivent-tout-change-.html succeeded on attempt: 0/5
2022-04-02 13:33:05.904 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vx3crb-le-feu-est-allum-.html succeeded on attempt: 0/5
2022-04-02 13:33:06.433 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vwwupr-dqodage-du-poste-to1528-du-08032022.html succeeded on attempt: 0/5
2022-04-02 13:33:06.985 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vwqqjx-biblique-.html succeeded on attempt: 0/5
2022-04-02 13:33:07.515 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vwcvrj-une-nouvelle-cit-et-la-folie-de-mars....html succeeded on attempt: 0/5
2022-04-02 13:33:08.041 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vw59cp-tout-devient-clair-.html succeeded on attempt: 0/5
2022-04-02 13:33:08.582 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vw0set-ltat-profond-en-panique-totale-concernant-lukraine-.html succeeded on attempt: 0/5
2022-04-02 13:33:09.114 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vw02qn-une-socit-parallle-contre-le-totalitarisme-comment-crer-un-monde-libre.html succeeded on attempt: 0/5
2022-04-02 13:33:09.696 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vvy1n2-petit-tour-de-dsinformation-autour-de-lukraine-26-02-22.html succeeded on attempt: 0/5
2022-04-02 13:33:10.226 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vvu7tt-le-retour-de-q-.html succeeded on attempt: 0/5
2022-04-02 13:33:10.780 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vvigq7-la-transition-commence.html succeeded on attempt: 0/5
2022-04-02 13:33:11.332 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vv8npt-dvolution-en-cours....html succeeded on attempt: 0/5
2022-04-02 13:33:11.891 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vv03yz-la-famille-se-runit-et-le-vent-tourne....html succeeded on attempt: 0/5
2022-04-02 13:33:12.438 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vuop2f-le-deuil-dun-monde-ancien-et-mort....html succeeded on attempt: 0/5
2022-04-02 13:33:13.002 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vuh0se-fou-rire-lors-du-dernier-live-08-02-2022.html succeeded on attempt: 0/5
2022-04-02 13:33:13.564 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vudj3g-quattends-tu-.html succeeded on attempt: 0/5
2022-04-02 13:33:14.117 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vu11tf-phase-5...-impact-.html succeeded on attempt: 0/5
2022-04-02 13:33:14.656 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vtsbtv-le-dilemme....html succeeded on attempt: 0/5
2022-04-02 13:33:15.212 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vtgtqx-les-nuages....html succeeded on attempt: 0/5
2022-04-02 13:33:15.757 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vt7e74-invitable.html succeeded on attempt: 0/5
2022-04-02 13:33:16.280 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/c/deqodeurs?page=2 succeeded on attempt: 0/5
2022-04-02 13:33:16.816 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vt0hhz-anatomie-dune-tempte.html succeeded on attempt: 0/5
2022-04-02 13:33:17.370 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vsrz8n-la-tempte-arrive-.html succeeded on attempt: 0/5
2022-04-02 13:33:17.928 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vsj2c7-chec-et-mat-.html succeeded on attempt: 0/5
2022-04-02 13:33:18.627 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vsffd5-a-message-to-steve-bannon-from-french-patriots.html succeeded on attempt: 0/5
2022-04-02 13:33:19.199 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vsd101-on-a-remport-cette-bataille-.html succeeded on attempt: 0/5
2022-04-02 13:33:19.733 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vs48vz-cest-un-pige-.html succeeded on attempt: 0/5
2022-04-02 13:33:20.288 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vrwlv7-psychose-de-masse-.html succeeded on attempt: 0/5
2022-04-02 13:33:20.845 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vrq1k9-rveillon-2022.html succeeded on attempt: 0/5
2022-04-02 13:33:21.436 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vrp32v-les-10-commandements-pour-2022.html succeeded on attempt: 0/5
2022-04-02 13:33:21.978 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vrj1o7-actu-questions-et-rponses.html succeeded on attempt: 0/5
2022-04-02 13:33:22.524 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vrc2cz-nol-en-famille.html succeeded on attempt: 0/5
2022-04-02 13:33:23.087 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vr8xhx-urgent-larme-amricaine-teste-un-vaccin-universel-pour-combattre-le-vaxxin-.html succeeded on attempt: 0/5
2022-04-02 13:33:23.656 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vr63ph-celui-qui-a-des-yeux....html succeeded on attempt: 0/5
2022-04-02 13:33:24.215 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/vqxu26-lappel-des-cieux.html succeeded on attempt: 0/5
2022-04-02 13:33:24.246 | ERROR    | cisticola.scraper.base:scrape_channels:398 - An error has been caught in function 'scrape_channels', process 'MainProcess' (63504), thread 'MainThread' (139778754606912):
Traceback (most recent call last):

  File "/home/loganw/cisticola/app.py", line 136, in <module>
    scrape_channels(args)
    │               └ Namespace(command='scrape-channels', gsheet=None, media=False)
    └ <function scrape_channels at 0x7f20b9d52a60>

  File "/home/loganw/cisticola/app.py", line 100, in scrape_channels
    controller.scrape_all_channels(archive_media = args.media)
    │          │                                   │    └ False
    │          │                                   └ Namespace(command='scrape-channels', gsheet=None, media=False)
    │          └ <function ScraperController.scrape_all_channels at 0x7f20bb9fd280>
    └ <cisticola.scraper.base.ScraperController object at 0x7f20b9d1bcd0>

  File "/home/loganw/cisticola/cisticola/scraper/base.py", line 332, in scrape_all_channels
    return self.scrape_channels(channels, archive_media=archive_media)
           │    │               │                       └ False
           │    │               └ [Channel(name='Qanonfighters', platform_id='1412770923', category='qanon', platform='Telegram', url='https://ttttt.me/qanonfi...
           │    └ <function ScraperController.scrape_channels at 0x7f20bb9fd3a0>
           └ <cisticola.scraper.base.ScraperController object at 0x7f20b9d1bcd0>

> File "/home/loganw/cisticola/cisticola/scraper/base.py", line 398, in scrape_channels
    for post in posts:
        │       └ <generator object RumbleScraper.get_posts at 0x7f20b78f35f0>
        └ ScraperResult(scraper='RumbleScraper 0.0.1', platform='Rumble', channel=68, platform_id='vobny0', date=datetime.datetime(2021...

  File "/home/loganw/cisticola/cisticola/scraper/rumble.py", line 23, in get_posts
    for post in scraper:
        │       └ <generator object get_channel_videos at 0x7f20b78f3190>
        └ {'title': "L'Appel Des Cieux", 'thumbnail': 'https://sp.rmbl.ws/s8/1/U/b/N/S/UbNSc.oq1b.2-small-LAppel-Des-Cieux.jpg', 'link'...

  File "/home/loganw/cisticola/cisticola/scraper/rumble.py", line 127, in get_channel_videos
    yield process_video(video)
          │             └ <li class="video-listing-entry"><article class="video-item"><h3 class="video-item--title">FactCheckers = FakeNews !!!</h3><a ...
          └ <function process_video at 0x7f20ba8a9790>

  File "/home/loganw/cisticola/cisticola/scraper/rumble.py", line 101, in process_video
    'views' : video.find('span', {'class' : 'video-item--views'})['data-value'],
              │     └ <function Tag.find at 0x7f20bb3719d0>
              └ <li class="video-listing-entry"><article class="video-item"><h3 class="video-item--title">FactCheckers = FakeNews !!!</h3><a ...

TypeError: 'NoneType' object is not subscriptable
2022-04-02 13:33:24.270 | INFO     | cisticola.scraper.base:scrape_channels:407 - RumbleScraper 0.0.1 found 39 new posts from Channel(name='Les DéQodeurs', platform_id=None, category='explicit_qanon', platform='Rumble', url='https://rumble.com/c/deqodeurs', screenname=None, country='FR', influencer='Léonard Solji', public=False, chat=False, notes='Les DeQodeurs', source='researcher')
loganwilliams commented 2 years ago
2022-04-02 13:43:07.625 | DEBUG    | cisticola.utils:make_request:41 - Request for url: https://rumble.com/user/ChloeFra?page=1 succeeded on attempt: 0/5
2022-04-02 13:43:07.643 | ERROR    | cisticola.scraper.base:scrape_channels:398 - An error has been caught in function 'scrape_channels', process 'MainProcess' (63504), thread 'MainThread' (139778754606912):
Traceback (most recent call last):

  File "/home/loganw/cisticola/app.py", line 136, in <module>
    scrape_channels(args)
    │               └ Namespace(command='scrape-channels', gsheet=None, media=False)
    └ <function scrape_channels at 0x7f20b9d52a60>

  File "/home/loganw/cisticola/app.py", line 100, in scrape_channels
    controller.scrape_all_channels(archive_media = args.media)
    │          │                                   │    └ False
    │          │                                   └ Namespace(command='scrape-channels', gsheet=None, media=False)
    │          └ <function ScraperController.scrape_all_channels at 0x7f20bb9fd280>
    └ <cisticola.scraper.base.ScraperController object at 0x7f20b9d1bcd0>

  File "/home/loganw/cisticola/cisticola/scraper/base.py", line 332, in scrape_all_channels
    return self.scrape_channels(channels, archive_media=archive_media)
           │    │               │                       └ False
           │    │               └ [Channel(name='Qanonfighters', platform_id='1412770923', category='qanon', platform='Telegram', url='https://ttttt.me/qanonfi...
           │    └ <function ScraperController.scrape_channels at 0x7f20bb9fd3a0>
           └ <cisticola.scraper.base.ScraperController object at 0x7f20b9d1bcd0>

> File "/home/loganw/cisticola/cisticola/scraper/base.py", line 398, in scrape_channels
    for post in posts:
        │       └ <generator object RumbleScraper.get_posts at 0x7f20b7524040>
        └ ScraperResult(scraper='TwitterScraper 0.0.1', platform='Twitter', channel=178, platform_id='1497407385425612804', date=dateti...

  File "/home/loganw/cisticola/cisticola/scraper/rumble.py", line 23, in get_posts
    for post in scraper:
                └ <generator object get_channel_videos at 0x7f20b7983e40>

  File "/home/loganw/cisticola/cisticola/scraper/rumble.py", line 127, in get_channel_videos
    yield process_video(video)
          │             └ <li class="video-listing-entry"><article class="video-item"><h3 class="video-item--title">lNFO en QUESTlONS #94 avec Raphaël ...
          └ <function process_video at 0x7f20ba8a9790>

  File "/home/loganw/cisticola/cisticola/scraper/rumble.py", line 101, in process_video
    'views' : video.find('span', {'class' : 'video-item--views'})['data-value'],
              │     └ <function Tag.find at 0x7f20bb3719d0>
              └ <li class="video-listing-entry"><article class="video-item"><h3 class="video-item--title">lNFO en QUESTlONS #94 avec Raphaël ...

TypeError: 'NoneType' object is not subscriptable
2022-04-02 13:43:07.665 | INFO     | cisticola.scraper.base:scrape_channels:407 - RumbleScraper 0.0.1 found 0 new posts from Channel(name='ChloeFra', platform_id=None, category='conspiracy', platform='Rumble', url='https://rumble.com/user/ChloeFra', screenname=None, country='FR', influencer='Chloé Frammery', public=True, chat=False, notes=None, source='researcher')
trislee commented 2 years ago

Fixed in https://github.com/bellingcat/cisticola/commit/b0a52e5ad7d600645951247a6e91a724f83cd677