Closed milliken closed 7 years ago
Thanks for reporting @milliken! Sorry for not getting back earlier.
wayback_archiver
uses another gem site_mapper
and thats where the crawling is done.
I wonder if site_mapper/crawler.rb#L60
could be to blame. Thats where I try to catch certain exceptions (which I really should do there..).
I have wanted to replace site_mapper
with spidr
for some time, which is a much more mature and stable gem. I even have a half finished branch for it https://github.com/buren/wayback_archiver/compare/spidr.
@milliken if this is something that blocks you I'd be happy to work with you find a work around (or just finish the spidr branch). If not, I still want to move to spidr ;)
Thanks for responding @buren. I was hoping to use it before some US federal socioeconomic data, that hadn't been recently crawled by Wayback, disappeared from websites. So for that purpose, a work around would be great. I'm not much of a Ruby expert but would be happy to help test changes. Spidr does sound like the way to go in the long term and I'd be happy to help test that too!
@milliken I have a working implementation using spidr
(rather than site_mapper
) working, see https://github.com/buren/wayback_archiver/pull/8.
When testing it I ran into a problem where one of the sites that I crawled had a bad link in the HTML causing an exception in spidr
. I posted an issue @ spidr
s GitHub repo here: https://github.com/postmodern/spidr/issues/57.
If you want to test it out it would be awesome 🙇
I'll report back here if I find anything else ;)
Great. That’s quick. How can I install it to try the branch?
@milliken I'll post instructions in a bit, currently running/testing the archiver on https://www.bls.gov :)
Also I think I will try to try to increase archiving performance, its quite painful to archive a large site. I'll try to introduce some a thread-pool or something for speed 🚀
tl;dr: @milliken I've released a new version (v0.12) of wayback_archiver
to RubyGems. You can install the new version by running $ gem update wayback_archiver
or with bundle update
if you're using a Gemfile
.
The new release contains 2 changes
site_mapper
with spidr
#8 If you have any problems with the new release post another issue :)
wayback_archiver stops collecting URLs and throws the error below. The site I'm crawling is pretty big, not sure if that is a factor or not.
Error output below: