JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.42k stars 706 forks source link

VK user scraper fails with an AssertionError #340

Closed TheTechRobo closed 2 years ago

TheTechRobo commented 2 years ago
➜  scihub snscrape --jsonl --progress vkontakte-user sci_hub > sci_hub.vk.txt
2022-01-03 19:39:30.311  WARNING  snscrape.modules.vkontakte  Skipping post without link: '<div class="copy_quote"><div class="copy_post_header">\n<a class="copy_post_image" data-parent-post-id="-36928352_32104" data-post-id="8797034_5881" href="/barbara_barbaris">\n<img alt="Barbara Barbaris'
Scraping, 100 results so far
2022-01-03 19:39:36.573  CRITICAL  snscrape._cli  Dumped stack and locals to /tmp/snscrape_locals_i29iynkf
Traceback (most recent call last):
  File "/home/thetechrobo/.local/bin/snscrape", line 8, in <module>
    sys.exit(main())
  File "/home/thetechrobo/.local/lib/python3.9/site-packages/snscrape/_cli.py", line 308, in main
    for i, item in enumerate(scraper.get_items(), start = 1):
  File "/home/thetechrobo/.local/lib/python3.9/site-packages/snscrape/modules/vkontakte.py", line 297, in get_items
    yield from _process_soup(soup)
  File "/home/thetechrobo/.local/lib/python3.9/site-packages/snscrape/modules/vkontakte.py", line 266, in _process_soup
    for item in self._soup_to_items(soup):
  File "/home/thetechrobo/.local/lib/python3.9/site-packages/snscrape/modules/vkontakte.py", line 224, in _soup_to_items
    yield self._post_div_to_item(post)
  File "/home/thetechrobo/.local/lib/python3.9/site-packages/snscrape/modules/vkontakte.py", line 211, in _post_div_to_item
    quotedPost = self._post_div_to_item(quoteDiv, isCopy = True) if (quoteDiv := post.find('div', class_ = 'copy_quote')) else None
  File "/home/thetechrobo/.local/lib/python3.9/site-packages/snscrape/modules/vkontakte.py", line 154, in _post_div_to_item
    assert (url.startswith('https://vk.com/wall') or (isCopy and (url.startswith('https://vk.com/video') or url.startswith('https://vk.com/photo')))) and '_' in url and url[-1] != '_' and url.rsplit('_', 1)[1].strip('0123456789') == ''
AssertionError

snscrape_locals_i29iynkf

JustAnotherArchivist commented 2 years ago

Thanks. The offending post is a quoted comment. I didn't even know that was possible...