DannyBen / snapcrawl

Crawl a website and take screenshots
MIT License
57 stars 12 forks source link

Multiple Root Passes with # hrefs #39

Closed andrewexton373 closed 3 years ago

andrewexton373 commented 3 years ago

I'm trying to capture images from the site http://ransomit.com

snapcrawl seems to get confused about the structure of the website. I believe the issue is caused by href='#' included in the HTML of the example website provided.

andrew:snaps/ $ snapcrawl ransomit.com log_level=0 depth=4 width=1024                                                               [22:17:53]
DEBUG : verifying phantomjs is present
DEBUG : verifying imagemagick is present
DEBUG : initializing cli
DEBUG : initializing crawler with http://ransomit.com
DEBUG : config {"depth"=>4, "width"=>1024, "height"=>0, "cache_life"=>86400, "cache_dir"=>"cache", "snaps_dir"=>"snaps", "name_template"=>"%%{url}", "url_whitelist"=>nil, "url_blacklist"=>nil, "css_selector"=>nil, "log_level"=>0, "log_color"=>"auto", "skip_ssl_verification"=>false, "screenshot_delay"=>nil}
DEBUG : processing queue: 1 remaining
 INFO : processing http://ransomit.com, depth: 0
 INFO : screenshot for / already exists
DEBUG : processing queue: 1 remaining
 INFO : processing http://ransomit.com#, depth: 1
 INFO : screenshot for / already exists
DEBUG : processing queue: 1 remaining
 INFO : processing http://ransomit.com#, depth: 2
 INFO : screenshot for / already exists
DEBUG : processing queue: 1 remaining
 INFO : processing http://ransomit.com#, depth: 3
 INFO : screenshot for / already exists
DEBUG : processing queue: 1 remaining
 INFO : processing http://ransomit.com#, depth: 4
 INFO : screenshot for / already exists

Are there any steps I can take to successfully crawl and screenshot all levels of this site? Maybe utilizing the url_blacklist feature? Also, is this a bug, or an expected result? I'm not completely sure.

DannyBen commented 3 years ago

It should not get confused by the hash part of the URL. I need to check it further.

DannyBen commented 3 years ago

Well - found the issue.

Since your command had no protocol, the assumed site is http://ransomit.com, and then links to https://ransomit.com are considered external and are therefore not crawled.

Just run it with snapcrawl https://ransomit.com depth=2 width=1024 and you should be fine.

I will see if I can improve that unexpected behavior.

andrewexton373 commented 3 years ago

Ahhh, that makes sense. Appreciate the help!