error while collecting URLs during crawl

milliken commented 7 years ago

wayback_archiver stops collecting URLs and throws the error below. The site I'm crawling is pretty big, not sure if that is a factor or not.

Error output below:

[200, OK] https://www.bls.gov/guide/geography/projections.htm
[200, OK] https://www.bls.gov/schedule/2017/01_sched.htm
/Users/larry/.rvm/gems/ruby-2.4.0/gems/site_mapper-0.0.13/lib/site_mapper/crawler.rb:60:in `rescue in collect_urls': uninitialized constant SiteMapper::Crawler::IRB (NameError)
    from /Users/larry/.rvm/gems/ruby-2.4.0/gems/site_mapper-0.0.13/lib/site_mapper/crawler.rb:49:in `collect_urls'
    from /Users/larry/.rvm/gems/ruby-2.4.0/gems/site_mapper-0.0.13/lib/site_mapper/crawler.rb:35:in `collect_urls'
    from /Users/larry/.rvm/gems/ruby-2.4.0/gems/site_mapper-0.0.13/lib/site_mapper.rb:35:in `map'
    from /Users/larry/.rvm/gems/ruby-2.4.0/gems/wayback_archiver-0.0.11/lib/wayback_archiver/url_collector.rb:22:in `crawl'
    from /Users/larry/.rvm/gems/ruby-2.4.0/gems/wayback_archiver-0.0.11/lib/wayback_archiver.rb:32:in `archive'
    from /Users/larry/.rvm/gems/ruby-2.4.0/gems/wayback_archiver-0.0.11/bin/wayback_archiver:9:in `<top (required)>'
    from /Users/larry/.rvm/gems/ruby-2.4.0/bin/wayback_archiver:22:in `load'
    from /Users/larry/.rvm/gems/ruby-2.4.0/bin/wayback_archiver:22:in `<main>'
    from /Users/larry/.rvm/gems/ruby-2.4.0/bin/ruby_executable_hooks:15:in `eval'
    from /Users/larry/.rvm/gems/ruby-2.4.0/bin/ruby_executable_hooks:15:in `<main>'

buren commented 7 years ago

Thanks for reporting @milliken! Sorry for not getting back earlier.

wayback_archiver uses another gem site_mapper and thats where the crawling is done.

I wonder if site_mapper/crawler.rb#L60 could be to blame. Thats where I try to catch certain exceptions (which I really should do there..).

I have wanted to replace site_mapper with spidr for some time, which is a much more mature and stable gem. I even have a half finished branch for it https://github.com/buren/wayback_archiver/compare/spidr.

@milliken if this is something that blocks you I'd be happy to work with you find a work around (or just finish the spidr branch). If not, I still want to move to spidr ;)

milliken commented 7 years ago

Thanks for responding @buren. I was hoping to use it before some US federal socioeconomic data, that hadn't been recently crawled by Wayback, disappeared from websites. So for that purpose, a work around would be great. I'm not much of a Ruby expert but would be happy to help test changes. Spidr does sound like the way to go in the long term and I'd be happy to help test that too!

buren commented 7 years ago

@milliken I have a working implementation using spidr (rather than site_mapper) working, see https://github.com/buren/wayback_archiver/pull/8.

When testing it I ran into a problem where one of the sites that I crawled had a bad link in the HTML causing an exception in spidr. I posted an issue @ spidrs GitHub repo here: https://github.com/postmodern/spidr/issues/57.

If you want to test it out it would be awesome 🙇

I'll report back here if I find anything else ;)

milliken commented 7 years ago

Great. That’s quick. How can I install it to try the branch?

buren commented 7 years ago

@milliken I'll post instructions in a bit, currently running/testing the archiver on https://www.bls.gov :)

Also I think I will try to try to increase archiving performance, its quite painful to archive a large site. I'll try to introduce some a thread-pool or something for speed 🚀

buren commented 7 years ago

tl;dr: @milliken I've released a new version (v0.12) of wayback_archiver to RubyGems. You can install the new version by running $ gem update wayback_archiver or with bundle update if you're using a Gemfile.

The new release contains 2 changes

Replacing site_mapper with spidr #8
Concurrency for the crawler #9
- The collection of URLs are still done in a single thread, then done 10 threads will be used to send those URLs to the Wayback Machine. (I would really like to be smarter about this and push each found URL directly rather than waiting on all URLs to be found and then post them.)

If you have any problems with the new release post another issue :)

buren / wayback_archiver

error while collecting URLs during crawl #7