Traceback issue - Githubissues

eborbath commented 5 years ago

Hi,

Thanks for your work with the crawler. I keep running into an error when trying to scrape a page. The script runs but it misses many posts, mostly because of this issue I think. I also tried the original repo and keep running into the same issue.

Do you have a suggestion what happens and how to fix it?

2019-04-27 19:42:33 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/ufi/reaction/profile/browser/?ft_ent_identifier=499126151404&refid=17&_ft_=top_level_post_id.499126196404%3Atl_objid.499126196404%3Apage_id.287770891404%3Aphoto_attachments_list.%5B499126196404%2C499126221404%2C499126241404%2C499126266404%5D%3Aphoto_id.499126196404%3Astory_location.4%3Astory_attachment_style.new_album%3Apage_insights.%7B%22287770891404%22%3A%7B%22role%22%3A1%2C%22page_id%22%3A287770891404%2C%22post_context%22%3A%7B%22story_fbid%22%3A%5B499129721404%2C499126151404%5D%2C%22publish_time%22%3A1290008806%2C%22story_name%22%3A%22EntPhotoNodeBasedEdgeStory%22%2C%22object_fbtype%22%3A22%7D%2C%22actor_id%22%3A287770891404%2C%22psn%22%3A%22EntPhotoNodeBasedEdgeStory%22%2C%22sl%22%3A4%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22targets%22%3A%5B%7B%22page_id%22%3A287770891404%2C%22actor_id%22%3A287770891404%2C%22role%22%3A1%2C%22post_id%22%3A499129721404%2C%22share_id%22%3A0%7D%2C%7B%22page_id%22%3A287770891404%2C%22actor_id%22%3A287770891404%2C%22role%22%3A1%2C%22post_id%22%3A499126151404%2C%22share_id%22%3A0%7D%5D%7D%7D%3Athid.287770891404%3A306061129499414%3A43%3A1262332800%3A1293868799%3A-8996129610783703609&__tn__=%2AW-R#footer_action_list> (referer: https://mbasic.facebook.com/JobbikMagyarorszagertMozgalom?sectionLoadingID=m_timeline_loading_div_1293868799_1262332800_8_timeline_unit%3A1%3A00000000001290163115%3A04611686018427387904%3A09223372036854775778%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001290163115%3A04611686018427387904%3A09223372036854775778%3A04611686018427387904&timeend=1293868799&timestart=1262332800&tm=AQB7osGp0uU2aPmc&refid=17)
Traceback (most recent call last):
  File "c:\program files\python\python37\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\program files\python\python37\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\program files\python\python37\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\program files\python\python37\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\program files\python\python37\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\endre\Documents\GitHub\fbcrawl2\fbcrawl\spiders\fbcrawl.py", line 189, in parse_post
    reactions = response.urljoin(reactions[0].extract())
  File "c:\program files\python\python37\lib\site-packages\parsel\selector.py", line 61, in __getitem__
    o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range

ademjemaa commented 5 years ago

hey, did you use the reactions spider or the post spider ? in the reaction spider if the link becomes too long it doesnt crawl anymore, am working on a solution for that (changing the fetch from 10 to 200) if its the post spider ill look into it

eborbath commented 5 years ago

Hey! I have used the post spider. I actually also opened an issue here, so if you want, you can also close this and perhaps contribute there? Sorry for cross-posting.

ademjemaa commented 5 years ago

no problem for cross posting, i dont directly follow the fb crawl project anymore, ill look into it

eborbath commented 5 years ago

Hi! this is now fixed here. I close the issue.

ademjemaa / fbcrawl

Traceback issue #1