ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.35k stars 134 forks source link

Allow resuming a crawl #58

Open ivan opened 8 years ago

ivan commented 8 years ago

Some details in https://github.com/ludios/grab-site/issues/57#issuecomment-164185837

ivan commented 8 years ago

But note https://github.com/chfoo/wpull/issues/131

ivan commented 8 years ago

CRIU might be another option for suspending/resuming crawls on Ubuntu 15.10/16.04+. It works by dumping/restoring restoring a snapshot of the process to/from disk.

As root, I managed to dump and restore a grab-site process on Ubuntu 15.10 with

apt-get install criu
mkdir -p /root/criu-dump-gs-1
cd /root/criu-dump-gs-1
criu dump --shell-job --tcp-established -t GRAB-SITE-PID
criu restore --shell-job --tcp-established

Note that CRIU is very finicky, and the restore sometimes needs to be done in a new pid namespace: https://criu.org/When_C/R_fails

This might also fail to work if grab-site or its dependencies were updated after grab-site was launched; I have not tested this yet.

ivan commented 6 years ago

I had some success with criu dump --tcp-established --shell-job --ghost-limit 20000000 -t PID and criu restore --tcp-established --shell-job (in a tmux) again, but unfortunately grab-site processes crash about 50% of the time on restore with:

Traceback (most recent call last):
  File "/home/grab/gs-venv/bin/grab-site", line 4, in <module>
    main.main()
  File "/home/grab/gs-venv/lib/python3.4/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/grab/gs-venv/lib/python3.4/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/grab/gs-venv/lib/python3.4/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/grab/gs-venv/lib/python3.4/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/grab/gs-venv/lib/python3.4/site-packages/libgrabsite/main.py", line 399, in main
    WebProcessor.NO_DOCUMENT_STATUS_CODES = \
  File "/home/grab/gs-venv/lib/python3.4/site-packages/wpull/__main__.py", line 40, in main
    exit_code = application.run_sync()
  File "/home/grab/gs-venv/lib/python3.4/site-packages/wpull/app.py", line 118, in run_sync
    return self._event_loop.run_until_complete(self.run())
  File "/home/grab/gs-venv/lib/python3.4/site-packages/trollius/base_events.py", line 338, in run_until_complete
    self.run_forever()
  File "/home/grab/gs-venv/lib/python3.4/site-packages/trollius/base_events.py", line 309, in run_forever
    self._run_once()
  File "/home/grab/gs-venv/lib/python3.4/site-packages/trollius/base_events.py", line 1181, in _run_once
    event_list = self._selector.select(timeout)
  File "/home/grab/gs-venv/lib/python3.4/site-packages/trollius/selectors.py", line 437, in select
    fd_event_list = wrap_error(self._epoll.poll, timeout, max_ev)
  File "/home/grab/gs-venv/lib/python3.4/site-packages/trollius/py33_exceptions.py", line 144, in wrap_error
    return func(*args, **kw)
OverflowError: timeout is too large
frankyifei commented 4 years ago

since wpull has a feature of resuming if used with --database option, is it difficult to implement this in grab-site?

JustAnotherArchivist commented 3 years ago

wpull cannot be fully resumed with --database; for example, cookies aren't kept in the database (ArchiveTeam/wpull#448). Not sure if any other state is missing.

TheTechRobo commented 3 years ago

Is there a way to resume the process nonetheless? I doubt cookies would matter for the things I'm crawling. They're just large websites.

TheTechRobo commented 2 years ago

I would like to pick this up (even though the cookie problem - we can just have a warning)

How difficult would this be to implement @ivan?

FraMecca commented 1 year ago

At the moment I am able to resume crawls using: wpull3 --warc-file=<warc file> --mirror-r -np --page-requisites --no-check-certificate --no-robots --database wpull.db -o wpull.log http://example.com/

retro-mouse commented 1 year ago

I don't know how difficult is to implement this on grab-site but for large site crawls is kind of a must. I would greatly appreciate it.