buren / wayback_archiver

Ruby gem to send URLs to Wayback Machine
https://rubygems.org/gems/wayback_archiver
MIT License
57 stars 11 forks source link

`check_path': path conflicts with opaque (URI::InvalidURIError) #25

Closed bartman081523 closed 5 years ago

bartman081523 commented 5 years ago

Traceback (most recent call last): 28: from /home/user/.gem/ruby/2.5.0/bin/wayback_archiver:23:in <main>' 27: from /home/user/.gem/ruby/2.5.0/bin/wayback_archiver:23:inload' 26: from /home/user/.gem/ruby/2.6.0/gems/wayback_archiver-1.2.1/bin/wayback_archiver:81:in <top (required)>' 25: from /home/user/.gem/ruby/2.6.0/gems/wayback_archiver-1.2.1/bin/wayback_archiver:81:ineach' 24: from /home/user/.gem/ruby/2.6.0/gems/wayback_archiver-1.2.1/bin/wayback_archiver:82:in block in <top (required)>' 23: from /home/user/.gem/ruby/2.6.0/gems/wayback_archiver-1.2.1/lib/wayback_archiver.rb:50:inarchive' 22: from /home/user/.gem/ruby/2.6.0/gems/wayback_archiver-1.2.1/lib/wayback_archiver.rb:91:in crawl' 21: from /home/user/.gem/ruby/2.6.0/gems/wayback_archiver-1.2.1/lib/wayback_archiver/archive.rb:75:incrawl' 20: from /home/user/.gem/ruby/2.6.0/gems/wayback_archiver-1.2.1/lib/wayback_archiver/url_collector.rb:37:in crawl' 19: from /home/user/.gem/ruby/2.6.0/gems/spidr-0.6.0/lib/spidr/spidr.rb:53:insite' 18: from /home/user/.gem/ruby/2.6.0/gems/spidr-0.6.0/lib/spidr/agent.rb:274:in site' 17: from /home/user/.gem/ruby/2.6.0/gems/spidr-0.6.0/lib/spidr/agent.rb:355:instart_at' 16: from /home/user/.gem/ruby/2.6.0/gems/spidr-0.6.0/lib/spidr/agent.rb:373:in run' 15: from /home/user/.gem/ruby/2.6.0/gems/spidr-0.6.0/lib/spidr/agent.rb:665:invisit_page' 14: from /home/user/.gem/ruby/2.6.0/gems/spidr-0.6.0/lib/spidr/agent.rb:599:in get_page' 13: from /home/user/.gem/ruby/2.6.0/gems/spidr-0.6.0/lib/spidr/agent.rb:788:inprepare_request' 12: from /home/user/.gem/ruby/2.6.0/gems/spidr-0.6.0/lib/spidr/agent.rb:605:in block in get_page' 11: from /home/user/.gem/ruby/2.6.0/gems/spidr-0.6.0/lib/spidr/agent.rb:679:inblock in visit_page' 10: from /home/user/.gem/ruby/2.6.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:238:in each_url' 9: from /home/user/.gem/ruby/2.6.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:188:ineach_link' 8: from /home/user/.gem/ruby/2.6.0/gems/nokogiri-1.10.1/lib/nokogiri/xml/node_set.rb:237:in each' 7: from /home/user/.gem/ruby/2.6.0/gems/nokogiri-1.10.1/lib/nokogiri/xml/node_set.rb:237:inupto' 6: from /home/user/.gem/ruby/2.6.0/gems/nokogiri-1.10.1/lib/nokogiri/xml/node_set.rb:238:in block in each' 5: from /home/user/.gem/ruby/2.6.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:189:inblock in each_link' 4: from /home/user/.gem/ruby/2.6.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:182:in block in each_link' 3: from /home/user/.gem/ruby/2.6.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:239:inblock in each_url' 2: from /home/user/.gem/ruby/2.6.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:283:in to_absolute' 1: from /usr/lib/ruby/2.6.0/uri/generic.rb:807:inpath=' /usr/lib/ruby/2.6.0/uri/generic.rb:753:in `check_path': path conflicts with opaque (URI::InvalidURIError)

buren commented 5 years ago

Thank you for your report.

Unfortunately this is a problem with spidr, see https://github.com/postmodern/spidr/issues/66. The issue has been closed and a fix has been merged, however the author has not yet released a new version to Rubygems and there is no way I can depend on the GitHub master branch in this gem (it's not possible).

I've been thinking about potentially pushing my own patched version of spidr to Rubygems, but haven't opted for that yet. I might though (perhaps we could open an issue in the original GitHub repo politely asking Postmodern to release a new version first).

bartman081523 commented 5 years ago

thank you for your concern. i also found the fix for spidr already here: https://github.com/postmodern/spidr/commit/ae885272619f74c69d43ec77852f158768c6d804

bartman081523 commented 5 years ago

you could bundle the git version from spidr with Bundler see here https://bundler.io/v1.12/git.html at the .gemspec gem 'spidr', :git => 'https://github.com/postmodern/spidr.git'

buren commented 5 years ago

Yeah I now that you can specify that in a Gemfile, however what we need here is to add it to wayback_archiver.gemspec and .gemspec files do not support that.

From https://bundler.io/v1.12/git.html

Because RubyGems lacks the ability to handle gems from git [...]

See https://stackoverflow.com/questions/6499410/ruby-gemspec-dependency-is-possible-have-a-git-branch-dependency.


ℹ️ Workaround

Explicitly add spidr to your Gemfile:

gem 'spidr', github: 'postmodern/spidr'