buren / wayback_archiver

Ruby gem to send URLs to Wayback Machine
https://rubygems.org/gems/wayback_archiver
MIT License
57 stars 11 forks source link

Error in running the script #46

Closed xplosionmind closed 2 years ago

xplosionmind commented 2 years ago

I have a Cron Job which runs this gem once a week:

My crontab -e:

# m h  dom mon dow   command
0 1 * * 1 /usr/local/bin/wayback_archiver https://tommi.space/pages-to-archive --crawl --limit=100 --verbose --log=$HOME/wayback_archiver.log && echo "\n$(date) wayback_archiver success!" >> $HOME/wayback_archiver.log

I get this error:

/usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:135:in `require': cannot load such file -- robots (LoadError)
    from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:135:in `rescue in require'
    from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:39:in `require'
    from /var/lib/gems/2.5.0/gems/wayback_archiver-1.4.0/lib/wayback_archiver/url_collector.rb:2:in `<top (required)>'
    from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require'
    from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require'
    from /var/lib/gems/2.5.0/gems/wayback_archiver-1.4.0/lib/wayback_archiver.rb:4:in `<top (required)>'
    from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require'
    from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require'
    from /var/lib/gems/2.5.0/gems/wayback_archiver-1.4.0/bin/wayback_archiver:4:in `<top (required)>'
    from /usr/local/bin/wayback_archiver:23:in `load'
    from /usr/local/bin/wayback_archiver:23:in `<main>'

what does it mean?
Could you help me fix it?

Thank you very much!

bartman081523 commented 2 years ago

I have two things to consider here:

1st: the command is successful on my ruby 3 install: Maybe consider to upgrade, as a fix.

~/.local/share/gem/ruby/3.0.0/bin/wayback_archiver https://tommi.space/pages-to-archive --crawl --limit=100 --verbose I, [2021-11-02T06:33:40.358061 #8179] INFO -- WaybackArchiver: Crawling https://tommi.space/pages-to-archive I, [2021-11-02T06:33:40.358120 #8179] INFO -- WaybackArchiver: Request are sent with up to 1 parallel threads D, [2021-11-02T06:33:41.150332 #8179] DEBUG -- WaybackArchiver: Found: https://tommi.space/pages-to-archive D, [2021-11-02T06:33:41.150502 #8179] DEBUG -- WaybackArchiver: Requesting https://web.archive.org/save/https://tommi.space/pages-to-archive D, [2021-11-02T06:33:54.899527 #8179] DEBUG -- WaybackArchiver: [302, FOUND] Requested https://web.archive.org/save/https://tommi.space/pages-to-archive I, [2021-11-02T06:33:54.899654 #8179] INFO -- WaybackArchiver: Posted [302, FOUND] https://tommi.space/pages-to-archive I, [2021-11-02T06:33:54.902015 #8179] INFO -- WaybackArchiver: Crawling of https://tommi.space/pages-to-archive finished, found 1 URL(s) I, [2021-11-02T06:33:54.902097 #8179] INFO -- WaybackArchiver: 1 URL(s) posted to Wayback Machine

2nd: As you see, your command only archives 1 URL: https://tommi.space/pages-to-archive The reason is, that the urls on this page: https://tommi.space/pages-to-archive , are no links, they are just plain text.

For Wayback Archiver to crawl the pages you want to archive, you would have to make them as actual html links to the pages you want to archive. Do you know what I mean?

2.1nd Instead of a html page with html links, you could also have a plain text file, with only the urls in the rows, without html and do this:

for line in $(wget -O- -q https://tommi.space/archive.txt); do wayback_archiver --crawl $line ; done

This crawls every line in the plain-text file and pipes them to Wayback Archiver. I think this would be much more stable.

Best whishes, chlorophyll-zz

xplosionmind commented 2 years ago

First of all, thank you very much for your comprehensive reply.

the command is successful on my ruby 3 install: Maybe consider to upgrade, as a fix.

Unfortunately, both with RVM and rbenv I am unable to install Ruby 3.0.x… do you have any ideas why? I am running Debian 10 (and YunoHost) on a VPS.



2.1nd Instead of a html page with html links, you could also have a plain text file, with only the urls in the rows, without html and do this:

for line in $(wget -O- -q https://tommi.space/archive.txt); do wayback_archiver --crawl $line ; done

I did so. Here, thanks a lot!

bartman081523 commented 2 years ago

First of all, thank you very much for your comprehensive reply.

Thank you too for your gratitude. You are welcome.

Unfortunately, both with RVM and rbenv I am unable to install Ruby 3.0.x… do you have any ideas why? I am running Debian 10 (and YunoHost) on a VPS.

I am sorry for the slightly wrong information. There is no need to install ruby 3.0 with rvm. The fixes for the "LoadError" are also in the ruby v2.7+2 debian package But you have to upgrade from ruby 2.5.0.

According to this: https://packages.debian.org/search?keywords=ruby Ruby is available in v2.7+2 on Debian (stable).

The program should run like intended with ruby 2.7. I am sure the errors that you posted initially are coming from an outdated ruby 2.5.0 version. When you have upgraded to the newest ruby version available, the errors should be gone.

Update your Debian packages:

sudo apt update
sudo apt upgrade -y

Now the newest ruby in Debian should be available on your system.

If that fixed the error, you can close the Issue. Also, if you want to, you could change the title to "ruby 2.5.0 - cannot load such file -- robots (LoadError)", then other members of the community with the same problem could find this fix when they search for the same error.

If you are still having errors after upgrading to ruby 2.7, do this on the console and post the output:

ruby --version
git clone https://github.com/chlorophyll-zz/wayback_archiver --branch patch-2
cd wayback_archiver
gem build wayback_archiver.gemspec
gem install wayback_archiver-1.4.0.gem

I have tested with rvm and ruby-2.5.8 on Ubuntu. The install is successful and the program runs like intended. The "LoadError" in your first post comes indeed from an outdated ruby 2.5.0 installation.

For your interest, thats what I did (derived from https://rvm.io/rvm/install):

\curl -sSL https://get.rvm.io | bash
login $USER
rvm list known
rvm install ruby-2.5
git clone https://github.com/chlorophyll-zz/wayback_archiver --branch patch-2
cd wayback_archiver
gem build wayback_archiver.gemspec
gem install wayback_archiver-1.4.0.gem
wayback_archiver --crawl www.example.com

I did so. Here, thanks a lot!

Great, I am glad that I could help.

Best wishes.

xplosionmind commented 2 years ago

Thanks again. So, I have no idea why, but Debian does not update Ruby to 2.7, I am stuck with 2.5.

Hence, I tried your suggestion by cloning the specific patch of wayback_archiver you pointed out. Nevertheless, after performing the previous commands, I execute gem install 'wayback_archiver-1.4.0.gem' (also without the 's), and I get:

gem install 'wayback_archiver-1.4.0.gem'
ERROR:  While executing gem ... (ArgumentError)
    wrong number of arguments (given 4, expected 1)

I am confused…

bartman081523 commented 2 years ago

Thanks again. So, I have no idea why, but Debian does not update Ruby to 2.7, I am stuck with 2.5.

It looks like you are stuck with debian oldstable (buster), because there, ruby is only available with version 2.5.1 See here: https://packages.debian.org/search?keywords=ruby

buster (oldstable) (ruby): 1:2.5.1: amd64

wrong number of arguments (given 4, expected 1)

I searched the error, and on another thread, this error came from "leftover" (incompatible?) installed gems.

What you can do first to test is delete your ruby gems folder. (!if you dont otherwhere (other than for wayback_archiver) need ruby gems on your system)

in a shell enter gem environment and delete the ruby gems folder listed under "GEM PATHS:"

Then you could try a second time (my patch is already merged, you can clone the default repo)

git clone https://github.com/buren/wayback_archiver
cd wayback_archiver
gem build wayback_archiver.gemspec
gem install wayback_archiver-1.4.0.gem

For a last resort, I found this to get you ruby 2.7 on debian buster: (as you already mentioned)

https://linuxconfig.org/how-to-set-up-rvm-on-debian-10-buster

I would first clean out the "GEM PATHS:" from gem environment and I would also uninstall the ruby packages from debian before installing rvm, to not have two ruby versions (colliding?) on the system: e.g. sudo apt-get purge ruby ruby-full ruby-rubygems ruby-dev ruby2.5

bartman081523 commented 2 years ago

Try this: (install rvm and ruby 3 and wayback archiver)

sudo rm -rf /var/lib/gems
sudo rm /usr/local/bin/wayback_archiver
sudo apt-get purge ruby
sudo apt-get autoremove
sudo apt-get install curl
sudo apt-get install git
command curl -sSL https://rvm.io/mpapis.asc | gpg --import -
command curl -sSL https://rvm.io/pkuczynski.asc | gpg --import -
\curl -sSL https://get.rvm.io | bash -s stable
source /home/$USER/.rvm/scripts/rvm
rvm install ruby-3
git clone https://github.com/buren/wayback_archiver
cd wayback_archiver
gem build wayback_archiver.gemspec
gem install wayback_archiver-1.4.0.gem 
wayback_archiver --url www.example.com
echo "source /home/$USER/.rvm/scripts/rvm" >> /home/$USER/.bashrc
xplosionmind commented 2 years ago

Try this

IT WORKS!
Thank you very much. Now, the problem is that I had this in my crontab:

# m h  dom mon dow   command
0 1 * * 1 /usr/local/bin/wayback_archiver https://tommi.space/pages-to-archive --crawl --limit=100 --verbose --log=$HOME/wayback_archiver.log && echo "\n$(date) wayback_archiver success!" >> $HOME/wayback_archiver.log

Specifically, running /usr/local/bin/wayback_archiver does not work. Which command should I call from crontab to schedule wayback archiver to run?

bartman081523 commented 2 years ago

IT WORKS!

Great! Finally :-D Then you can close the Issue.

Specifically, running /usr/local/bin/wayback_archiver does not work.

try this, this works for me: source /home/$USER/.rvm/scripts/rvm && wayback_archiver (+your parameters)