JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.43k stars 706 forks source link

AttributeError when scraping facebook user #36

Closed jodizzle closed 5 years ago

jodizzle commented 5 years ago

Similar to #35:

2019-05-01 22:25:38.717  INFO  snscrape.modules.facebook  Retrieving next page
2019-05-01 22:25:38.717  INFO  snscrape.base  Retrieving https://www.facebook.com/pages_reaction_units/more/?page_id=9094598058&cursor=%7B%22timeline_cursor%22%3A%22timeline_unit%3A1%3A00000000001319942759%3A04611686018427387904%3A09223372036854775704%3A04611686018427387904%22%2C%22timeline_section_cursor%22%3A%7B%22profile_id%22%3A9094598058%2C%22start%22%3A1325404800%2C%22end%22%3A1357027199%2C%22query_type%22%3A8%2C%22filter%22%3A1%7D%2C%22has_next_page%22%3Atrue%7D&surface=www_pages_home&unit_count=8&__a=1
2019-05-01 22:25:38.717  DEBUG  snscrape.base  ... with headers: {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36', 'Accept-Language': 'en-US,en;q=0.5'}
2019-05-01 22:25:39.275  DEBUG  urllib3.connectionpool  https://www.facebook.com:443 "GET /pages_reaction_units/more/?page_id=9094598058&cursor=%7B%22timeline_cursor%22%3A%22timeline_unit%3A1%3A00000000001319942759%3A04611686018427387904%3A09223372036854775704%3A04611686018427387904%22%2C%22timeline_section_cursor%22%3A%7B%22profile_id%22%3A9094598058%2C%22start%22%3A1325404800%2C%22end%22%3A1357027199%2C%22query_type%22%3A8%2C%22filter%22%3A1%7D%2C%22has_next_page%22%3Atrue%7D&surface=www_pages_home&unit_count=8&__a=1 HTTP/1.1" 200 None
2019-05-01 22:25:39.279  DEBUG  snscrape.base  https://www.facebook.com/pages_reaction_units/more/?page_id=9094598058&cursor=%7B%22timeline_cursor%22%3A%22timeline_unit%3A1%3A00000000001319942759%3A04611686018427387904%3A09223372036854775704%3A04611686018427387904%22%2C%22timeline_section_cursor%22%3A%7B%22profile_id%22%3A9094598058%2C%22start%22%3A1325404800%2C%22end%22%3A1357027199%2C%22query_type%22%3A8%2C%22filter%22%3A1%7D%2C%22has_next_page%22%3Atrue%7D&surface=www_pages_home&unit_count=8&__a=1 retrieved successfully
Traceback (most recent call last):
  File "/home/user/env/bin/snscrape", line 11, in <module>
    load_entry_point('snscrape==0.1.3', 'console_scripts', 'snscrape')()
  File "/home/user/env/lib/python3.7/site-packages/snscrape-0.1.3-py3.7.egg/snscrape/cli.py", line 83, in main
    for i, item in enumerate(scraper.get_items(), start = 1):
  File "/home/user/env/lib/python3.7/site-packages/snscrape-0.1.3-py3.7.egg/snscrape/modules/facebook.py", line 112, in get_items
    yield from self._soup_to_items(soup, baseUrl)
  File "/home/user/env/lib/python3.7/site-packages/snscrape-0.1.3-py3.7.egg/snscrape/modules/facebook.py", line 60, in _soup_to_items
    href = entryA.get('href')
AttributeError: 'NoneType' object has no attribute 'get'

The facebook user is 'leopoldolopezoficial'. This one also seems to be reproducible.

jodizzle commented 5 years ago

I should mention that even though the snscrape version in the above log isn't technically 2.0, the commit being used is https://github.com/JustAnotherArchivist/snscrape/commit/64693f74bbca4183b29b664e0a1935b4fdbd2f2a, right before the version bump. Same for #35.

JustAnotherArchivist commented 5 years ago

Another example: Nacionalista-Party-197002442379 Reproducible right now with the current master.

JustAnotherArchivist commented 5 years ago

This is caused by timeline entries which don't have a link. Specifically, they're "media sets", an ancient feature for sharing multiple pictures at once. Not all media sets cause it though, only some. Since they're quite rare, I'll just skip them (with a warning). If anyone desires proper support for them, please open a separate issue.