DannyBen / snapcrawl

Crawl a website and take screenshots
MIT License
57 stars 12 forks source link

Screenshots not saving to default or specified folder locations #4

Closed SamuelMTDavies closed 5 years ago

SamuelMTDavies commented 5 years ago

Hi @DannyBen,

I have tried to use your tool to crawl a site -

Everything is working as it should as far as I can tell. But the screenshots are not saving. I have tried specifying various paths and ran the command as a super user but to no avail.

Comand Line: snapcrawl go https://www.website.co -d4

Result: Cycles through pages with the following for each picture.

Snap! Snapping picture... done Crawl: Page was cached. Reading subsequent URLs from cache

But nothing is saving as expected. I have looked in my caches but still no luck.

Have I misinterpreted the instructions or..? If I can get it working you will save me inordinate amounts of time.

Many Thanks,

Sam

DannyBen commented 5 years ago

Weird. I will take a look at it later today and reply with the findings. Can you please provide the exact command you are running, including the website you are trying to snap if possible?

SamuelMTDavies commented 5 years ago

I have tried all of the following with and without sudo (I tried some others but I think I got the syntax wrong or they weren't as comprehensive):

sudo snapcrawl go https://www.uown.co -d4 -f/Users/sam/snaps sudo snapcrawl go https://www.uown.co -d4 -fsnaps sudo snapcrawl go https://www.uown.co -d3

DannyBen commented 5 years ago

Ok, so a couple of thoughts so far:

  1. Are you on Windows or Linux? I haven't tested this at all on Windows
  2. snapcrawl requires phantomjs - if you are on linux, you don't have to do anything, since it installs it automatically, but if you are on windows, it could be that this automatic installation did not work.
  3. Trying your URL (or in fact, any https URL) on my machine, fails with an error
  4. snapcrawl depends on the screencap gem, which seems to no longer be maintained.
  5. I have tried downloading https screenshots directly with screencap, and I get the same error
  6. I found another ruby screen capture gem that I can implement in snapcrawl, and it works, but it will take me some time (not sure about your timeline and urgency concerns).

Your thoughts?

SamuelMTDavies commented 5 years ago

Hi Dan,

Just jumping to it.

  1. I'm on Mac OSX
  2. I installed phantomjs
  3. I tried it just as uown.co - but it didn't capture any subsequent levels of the site (maybe due to dns redirects?)
  4. / 5. I thought it may be a depreciation issue. Such a shame.
  5. I have to screengrab all pages of our site for some compliance obligations (~150 pages) which I need by EOP Friday - There are other options but they all require large amounts of manual effort on my part, so your tool suited my needs perfectly. I understand that this is potentially something you have built as a project, so I understand if you can't/don't want to commit to updating it on any timescale let alone a short turnaround.

If that is the case I understand and will use some of the other tools available online. I appreciate your time and effort replying.

DannyBen commented 5 years ago

There is a chance I can get something working by tomorrow early morning / noon. Maybe (not promising) even today.

I plan on fixing it either way - right now it just does not work.

Off-topic Edit: Not sure how it is on a mac, but in most circumstances I see, running a gem should not require sudo

SamuelMTDavies commented 5 years ago

You may be correct about the sudo - I just did so in case there was some finickety write permissions issue (I'm not very clued up on Ruby or coding beyond intermediate level stuff for that matter) so I thought it worth a try.

If you manage something by Friday I would be forever grateful. I will keep an eye out on this thread.

DannyBen commented 5 years ago

Stay tuned - I hope to have something ready today. I already integrated the new gem dependency, captured https successfully - next I test your site specifically, and do some polish (since some features will be removed), and release a gem for you to try.

DannyBen commented 5 years ago

Ok - if you want to test it, follow these steps:

  1. Create an empty folder and cd to it
  2. Create a new file called Gemfile
  3. Paste the content beloe in it
  4. Run bundle install
# Gemfile
source "https://rubygems.org"
git_source(:github) {|repo_name| "https://github.com/#{repo_name}" }
gem "snapcrawl", github: 'DannyBen/snapcrawl', branch: 'webshot'

I tested it with your site, it captures, but I am not sure it looks good - you need to remember that these "headless browsers" that are used for these captures, are equivalent to old browsers.

Also, ignore the weird output that it might print, it is the webshot gem doing it, I will sort this out later.

Lastly, make sure you have the right versions of everything:

phantomjs --version - should be 2.x snapcrawl --version - should be 0.2.4rc1

For starters, just run this command to capture the homepage:

snapcrawl go uown.co

SamuelMTDavies commented 5 years ago

Ok so I followed your instructions and got the following console print out. It created a snaps folder but still nothing in it.

-----> Visit: http://uown.co Snap! Snapping picture... done Crawl! Extracting links... /System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/open-uri.rb:225:inopen_loop': redirection forbidden: http://uown.co -> https://www.uown.co/ (RuntimeError) from /System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/open-uri.rb:151:in open_uri' from /System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/open-uri.rb:717:inopen' from /System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/open-uri.rb:35:in open' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.3/lib/snapcrawl/crawler.rb:120:inextract_urls_from!' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.3/lib/snapcrawl/crawler.rb:112:in extract_urls_from' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.3/lib/snapcrawl/crawler.rb:74:inblock in crawl_and_snap' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.3/lib/snapcrawl/crawler.rb:65:in each' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.3/lib/snapcrawl/crawler.rb:65:incrawl_and_snap' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.3/lib/snapcrawl/crawler.rb:59:in block in crawl' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.3/lib/snapcrawl/crawler.rb:58:intimes' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.3/lib/snapcrawl/crawler.rb:58:in crawl' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.3/lib/snapcrawl/crawler.rb:34:inexecute' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.3/lib/snapcrawl/crawler.rb:26:in handle' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.3/bin/snapcrawl:6:in<top (required)>' from /usr/local/bin/snapcrawl:22:in load' from /usr/local/bin/snapcrawl:22:in

'`

I then checked my snapcrawl version (I assumed the bundle install was updating it) but I'm still on 0.2.3 - What command do I need to run to update it? - gem update hasn't made any changes to my version.

DannyBen commented 5 years ago

Hmm...

Alright - forget the Gemfile solution (delete this file). I have just published the gem to RubyGems, so you can simply run this:

gem install snapcrawl --version 0.2.4rc1

And then check snapcrawl version before proceeding.

SamuelMTDavies commented 5 years ago

Still no images but I'm on the right snapcrawl version.

Console suggests that I'm missing a dependency? Cliver or phantomjs? It's a little cryptic

Snapping picture... /Library/Ruby/Gems/2.3.0/gems/cliver-0.3.2/lib/cliver/dependency.rb:143:inraise_not_found!': Could not find an executable ["phantomjs"] on your path. (Cliver::Dependency::NotFound) from /Library/Ruby/Gems/2.3.0/gems/cliver-0.3.2/lib/cliver/dependency.rb:116:in detect!' from /Library/Ruby/Gems/2.3.0/gems/cliver-0.3.2/lib/cliver.rb:24:indetect!' from /Library/Ruby/Gems/2.3.0/gems/poltergeist-1.12.0/lib/capybara/poltergeist/client.rb:48:in initialize' from /Library/Ruby/Gems/2.3.0/gems/poltergeist-1.12.0/lib/capybara/poltergeist/client.rb:14:innew' from /Library/Ruby/Gems/2.3.0/gems/poltergeist-1.12.0/lib/capybara/poltergeist/client.rb:14:in start' from /Library/Ruby/Gems/2.3.0/gems/poltergeist-1.12.0/lib/capybara/poltergeist/driver.rb:44:inclient' from /Library/Ruby/Gems/2.3.0/gems/poltergeist-1.12.0/lib/capybara/poltergeist/driver.rb:25:in browser' from /Library/Ruby/Gems/2.3.0/gems/poltergeist-1.12.0/lib/capybara/poltergeist/driver.rb:207:inresize' from /Library/Ruby/Gems/2.3.0/gems/webshot-0.1.0/lib/webshot/screenshot.rb:15:in initialize' from /System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/singleton.rb:142:innew' from /System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/singleton.rb:142:in block in instance' from /System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/singleton.rb:140:insynchronize' from /System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/singleton.rb:140:in instance' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:233:inwebshot' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:104:in snap!' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:85:insnap' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:72:in block in crawl_and_snap' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:65:ineach' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:65:in crawl_and_snap' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:59:inblock in crawl' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:58:in times' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:58:incrawl' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:34:in execute' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:26:inhandle' from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/bin/snapcrawl:6:in <top (required)>' from /usr/local/bin/snapcrawl:22:inload' from /usr/local/bin/snapcrawl:22:in <main>'

DannyBen commented 5 years ago

Yeah, not necessarily cryptic, just a messy Ruby backtrace :)

This is good - see the first or second line: Could not find an executable ["phantomjs"] on your path

Download it manually and place it in your path: http://phantomjs.org/download.html

Should just be a single binary called "phantomjs"

I wlil try to see why its not installing automatically.

SamuelMTDavies commented 5 years ago

Ok so added phantomjs to my path(usr/local/bin) - the operation got much further but still failed on a missing dependency. Not sure which gem has a dependency on mini-magick?

Snap!  Snapping picture... null
accepted
null
cookiesAccepted
/Library/Ruby/Gems/2.3.0/gems/mini_magick-4.3.6/lib/mini_magick/image.rb:200:in `rescue in validate!': ImageMagick/GraphicsMagick is not installed (MiniMagick::Invalid)
    from /Library/Ruby/Gems/2.3.0/gems/mini_magick-4.3.6/lib/mini_magick/image.rb:197:in `validate!'
    from /Library/Ruby/Gems/2.3.0/gems/mini_magick-4.3.6/lib/mini_magick/image.rb:113:in `block in create'
    from /Library/Ruby/Gems/2.3.0/gems/mini_magick-4.3.6/lib/mini_magick/image.rb:112:in `tap'
    from /Library/Ruby/Gems/2.3.0/gems/mini_magick-4.3.6/lib/mini_magick/image.rb:112:in `create'
    from /Library/Ruby/Gems/2.3.0/gems/mini_magick-4.3.6/lib/mini_magick/image.rb:34:in `read'
    from /Library/Ruby/Gems/2.3.0/gems/mini_magick-4.3.6/lib/mini_magick/image.rb:90:in `block in open'
    from /System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/open-uri.rb:37:in `open'
    from /System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/open-uri.rb:37:in `open'
    from /Library/Ruby/Gems/2.3.0/gems/mini_magick-4.3.6/lib/mini_magick/image.rb:89:in `open'
    from /Library/Ruby/Gems/2.3.0/gems/webshot-0.1.0/lib/webshot/screenshot.rb:73:in `capture'
    from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:104:in `snap!'
    from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:85:in `snap'
    from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:72:in `block in crawl_and_snap'
    from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:65:in `each'
    from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:65:in `crawl_and_snap'
    from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:59:in `block in crawl'
    from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:58:in `times'
    from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:58:in `crawl'
    from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:34:in `execute'
    from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/lib/snapcrawl/crawler.rb:26:in `handle'
    from /Library/Ruby/Gems/2.3.0/gems/snapcrawl-0.2.4rc1/bin/snapcrawl:6:in `<top (required)>'
    from /usr/local/bin/snapcrawl:22:in `load'
    from /usr/local/bin/snapcrawl:22:in `<main>'
DannyBen commented 5 years ago

I was afraid of that... the screenshot library uses a bunch of dependencies with zero-to-minimal maintenance status...

According to this StackOverflow answer, running brew install graphicsmagick should help. Can you try?

DannyBen commented 5 years ago

Also - please install the latest snapcrawl:

gem install snapcrawl --version 0.2.4rc4

Changes in this version:

I hope this works.

On my machine, it seems to be working nicely, and the captures are actually usable (with the exception maybe of the homepage, which has this dynamic scroll animation - you will have to capture it manually I guess)

This is the output you should expect: out

SamuelMTDavies commented 5 years ago

We have images!

Note: I can't run it as uown.co I have to https://www.uown.co due to redirects maybe this is something about my macs config?

`Sams-MacBook-Pro:~ sam$ snapcrawl go uown.co -d2

-----> Visit: http://uown.co Snap! Snapping picture... done Crawl! Extracting links...

RuntimeError redirection forbidden: http://uown.co -> https://www.uown.co/`

I am about to shoot out but I will try a full -d4 run later on today

DannyBen commented 5 years ago

Cool. About the redirects, maybe since it is a different phantomjs build, it behaves differently on a mac.

I am releasing it as a final 0.2.4 version.

SamuelMTDavies commented 5 years ago

@DannyBen Success. It crawled all the pages and took great screenshots. Only one as you said was the homepage due to the scrolling javascript causing it to look blank.

DannyBen commented 5 years ago

Excellent, glad we could sort this out. I am closing this ticket, but feel free to comment if there is anything else.