Open ivan opened 8 years ago
CRIU might be another option for suspending/resuming crawls on Ubuntu 15.10/16.04+. It works by dumping/restoring restoring a snapshot of the process to/from disk.
As root, I managed to dump and restore a grab-site process on Ubuntu 15.10 with
apt-get install criu
mkdir -p /root/criu-dump-gs-1
cd /root/criu-dump-gs-1
criu dump --shell-job --tcp-established -t GRAB-SITE-PID
criu restore --shell-job --tcp-established
Note that CRIU is very finicky, and the restore sometimes needs to be done in a new pid namespace: https://criu.org/When_C/R_fails
This might also fail to work if grab-site or its dependencies were updated after grab-site was launched; I have not tested this yet.
I had some success with criu dump --tcp-established --shell-job --ghost-limit 20000000 -t PID
and criu restore --tcp-established --shell-job
(in a tmux) again, but unfortunately grab-site processes crash about 50% of the time on restore with:
Traceback (most recent call last):
File "/home/grab/gs-venv/bin/grab-site", line 4, in <module>
main.main()
File "/home/grab/gs-venv/lib/python3.4/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/home/grab/gs-venv/lib/python3.4/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/grab/gs-venv/lib/python3.4/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/grab/gs-venv/lib/python3.4/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home/grab/gs-venv/lib/python3.4/site-packages/libgrabsite/main.py", line 399, in main
WebProcessor.NO_DOCUMENT_STATUS_CODES = \
File "/home/grab/gs-venv/lib/python3.4/site-packages/wpull/__main__.py", line 40, in main
exit_code = application.run_sync()
File "/home/grab/gs-venv/lib/python3.4/site-packages/wpull/app.py", line 118, in run_sync
return self._event_loop.run_until_complete(self.run())
File "/home/grab/gs-venv/lib/python3.4/site-packages/trollius/base_events.py", line 338, in run_until_complete
self.run_forever()
File "/home/grab/gs-venv/lib/python3.4/site-packages/trollius/base_events.py", line 309, in run_forever
self._run_once()
File "/home/grab/gs-venv/lib/python3.4/site-packages/trollius/base_events.py", line 1181, in _run_once
event_list = self._selector.select(timeout)
File "/home/grab/gs-venv/lib/python3.4/site-packages/trollius/selectors.py", line 437, in select
fd_event_list = wrap_error(self._epoll.poll, timeout, max_ev)
File "/home/grab/gs-venv/lib/python3.4/site-packages/trollius/py33_exceptions.py", line 144, in wrap_error
return func(*args, **kw)
OverflowError: timeout is too large
since wpull has a feature of resuming if used with --database option, is it difficult to implement this in grab-site?
wpull cannot be fully resumed with --database
; for example, cookies aren't kept in the database (ArchiveTeam/wpull#448). Not sure if any other state is missing.
Is there a way to resume the process nonetheless? I doubt cookies would matter for the things I'm crawling. They're just large websites.
I would like to pick this up (even though the cookie problem - we can just have a warning)
How difficult would this be to implement @ivan?
At the moment I am able to resume crawls using: wpull3 --warc-file=<warc file> --mirror-r -np --page-requisites --no-check-certificate --no-robots --database wpull.db -o wpull.log http://example.com/
I don't know how difficult is to implement this on grab-site but for large site crawls is kind of a must. I would greatly appreciate it.
Some details in https://github.com/ludios/grab-site/issues/57#issuecomment-164185837