ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Grab site is not actually compatible with python 3.8 #229

Open cenodis opened 1 year ago

cenodis commented 1 year ago

Update: These problems only happen with python 3.8. Using grab-site with python 3.7.16 fixes these problems on the same system.

I have recently upgraded my system to Ubuntu LTS 22.04.2. Grab-site now shows a few messages on the console relating to manhole as well as a warning about a HTTP session. I do not remember either of those appearing before the update. After these messages it outputs nothing. Similarly, no active scrape is shown on the gs-serv dashboard.

Looking at the filesystem it seems what wpull is still running and writing to the warc file. But no progress is visible on the console or gs-serv.

I have already tried resetting the python venv and reinstalling grab-site and its dependencies, following the exact instructions in the README. This did not fix the problem.

grab-site output

Manhole[202440:1685302622.5931]: Patched <built-in function fork> and <built-in function forkpty>.
Manhole[202440:1685302622.5941]: Manhole UDS path: /tmp/manhole-202440
Manhole[202440:1685302622.5941]: Waiting for new connection (in pid:202440) ...
/home/ubuntu/gs-venv/lib/python3.8/site-packages/wpull/protocol/http/client.py:185: UserWarning: HTTP session did not complete.
  warnings.warn(_('HTTP session did not complete.'))

gs-serv output

grab-site server listening on 0.0.0.0:29000
dropping connection to peer tcp4:127.0.0.1:32986 with abort=False: None
tcp4:127.0.0.1:32986 disconnected
tcp4:127.0.0.1:33026 connected
tcp4:127.0.0.1:33026 is dashboarding with Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36
dropping connection to peer tcp4:127.0.0.1:32996 with abort=False: None
tcp4:127.0.0.1:32996 disconnected
dropping connection to peer tcp4:127.0.0.1:33012 with abort=True: WebSocket opening handshake timeout (peer did not finish the opening handshake in time)
tcp4:127.0.0.1:33012 disconnected
TheTechRobo commented 1 year ago

I think I've seen this issue once I upgraded to Debian Bullseye. I fixed it by using a docker container: https://github.com/Nold360/docker-grab-site

cenodis commented 1 year ago

After a bit of experimentation I found out that the problem is python 3.8. Grab-site works perfectly fine with python 3.7.16. This contradicts the README which claims compatability with 3.7 and 3.8. A simple "fix" would be to update the installation instructions to fall back a major version. It already uses pyenv anyway.