ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.32k stars 130 forks source link

Add Dockerfile to simplify installation #93

Open notslang opened 7 years ago

notslang commented 7 years ago

I'm deploying a couple instances of grab-site to a CoreOS cluster, so I made a Dockerfile... Hopefully this is a bit easier to use than pip/virtualenv. The reason why this uses the larger python:3.4-slim image (rather than python:3.4-alpine) is because Alpine had some issues compiling https://github.com/dw/py-lmdb with its version of gcc.

This PR still needs docs, so it's a work-in-progress right now.

After starting the container you can use the regular grab-site command via docker exec <container-name> grab-site <args and site url>

ivan commented 7 years ago

I haven't used Docker, so bear with me...

1) Why COPY to /app/ if you still subsequently do a pip3 install .? If you pip3 install ., then grab-site, gs-server, etc should be installed somewhere, right?

2) Can you make the script in .travis.yml test that this Dockerfile works? (Probably after all the existing stuff.)

Thanks for working on this!

notslang commented 7 years ago

No prob - you can think of a Docker container as a lightweight VM... Like VirtualBox, but with better tooling and less overhead). The Dockerfile automates building/configuring the container and the COPY directive handles copying the code from your working directory into the container's file-system. Once the code is in the container (at /app) then we do pip3 install to get all the deps and set everything up.

This creates a fully isolated, reproducible installation of grab-site in a 200-300MB image. This image can be run on any host OS, including CoreOS where Python isn't even installed. Using Alpine as a base we could get this image down to 20-50MB, but that requires some modifications to py-lmdb.

As for testing, we can have https://hub.docker.com automatically rebuild the image whenever new code is pushed (see: https://docs.docker.com/docker-hub/builds/) and run Docker-based tests in Travis if you want: https://docs.travis-ci.com/user/docker/

ivan commented 7 years ago

pip3 install . should install grab-site in addition to the dependencies, though. pip3 install puts things in /usr/local/bin while pip3 install --user puts things in ~/.local/bin, unless there's some extra configuration doing something else. Would it make sense to use the installed grab-site scripts in one of those paths rather than duplicate some pip functionality with the COPY lines?

ivan commented 7 years ago

Is there an issue filed somewhere for py-lmdb's failure to compile on Alpine Linux's gcc?

notslang commented 7 years ago

pip3 install . is being run within the context of the Docker container (not the host OS) so you need to COPY the files into the container for pip to work.

ivan commented 7 years ago

Oh, that explains it :-)

notslang commented 7 years ago

There isn't an issue filed on https://github.com/dw/py-lmdb/issues yet.

igorbrigadir commented 7 years ago

Alpine had some issues compiling https://github.com/dw/py-lmdb with its version of gcc.

I haven't tried running grab-site, but it seems like installing py-lmdb works on python:3.4-alpine with this:

FROM python:3.4-alpine
RUN apk add --update build-base libffi-dev
RUN pip install lmdb
notslang commented 7 years ago

You're right about it working on Alpine - I was just missing libffi-dev. Now it's down to 112.4 MB (37 MB when compressed). Also, I added instructions to the README, so I'm going to remove the "[wip]" from this.

ivan commented 7 years ago

Thanks for the fixes.

I am currently somewhat busy and under-dockered, can a grab-site user please give the Docker instructions a try and see if they work? (And let me know if you had to perform any other steps to make this a useful setup?)

igorbrigadir commented 7 years ago

I had docker already, but worth linking to https://docs.docker.com/engine/installation/ instructions

I tried it with: Ubuntu: 12.04.5 LTS, x86_64, 3.8.0-44-generic Docker: Docker version 1.7.1, build 786b29d

Ran (sudo for docker commands because i skipped this step https://docs.docker.com/engine/installation/linux/ubuntulinux/#/create-a-docker-group):

sudo docker pull slang800/grab-site sudo docker run --detach -p 29000:29000 -v ~/grab-site-data:/data --name warcfactory slang800/grab-site Web UI worked on http://localhost:29000/ sudo docker exec warcfactory grab-site --no-offsite-links http://xkcd.com/

Crawl finished successfully!

ivan commented 7 years ago

I tried this out, but couldn't find a way to attach a terminal to a docker exec -d process (or a docker exec process that has been ctrl-c'ed - note the ctrl-c is not passed to the child). The reason that you sometimes need a terminal attached to a grab-site process is to 1) see which URL is currently being grabbed (this information is not reported to the dashboard, only finished responses) and 2) look at segfaults and websocket connection problems that don't get reported to the dashboard either.

Would adding tmux to the container and using tmux work? (Note, tmux 2.1 is broken; 1.8 is a known-good version.) I just hope that docker exec tmux attach works. If this does work, the documentation should also be updated.

ivan commented 7 years ago

Also, running gs-server as PID 1 seems undesirable because if it were killed, it would kill all the grab-site processes as well. grab-site processes are designed to stay running even if gs-server crashes or is taken down for an upgrade. Maybe gs-server (and each grab-site) should run in its own container instead.

notslang commented 7 years ago

Maybe gs-server (and each grab-site) should run in its own container instead.

Splitting up the server and client would make sense, especially since you could then run them on different machines, but I should probably do that as a separate PR, since I'll need to look into how they communicate.

Also, running gs-server as PID 1 seems undesirable because if it were killed, it would kill all the grab-site processes as well.

Would using dumb-init as PID 1 allow the orphaned grab-site processes to keep running in the case where gs-server dies? If so, that would be a decent temporary fix.

I tried this out, but couldn't find a way to attach a terminal to a docker exec -d process (or a docker exec process that has been ctrl-c'ed - note the ctrl-c is not passed to the child).

You could run docker exec without detatching, but this whole setup could be simplified by splitting up the processes into their own containers... Then you'd be able to use docker logs and pass signals in a sane manner.

semente commented 5 years ago

hey people! what is the status of this PR? I could give a hand.

ivan commented 5 years ago

For now, I would like someone else to be the Dockerized grab-site upstream. I don't use Docker and I don't have the resources to 1) figure out if a PR is taking the right approach with Dockerization (which base? which init? one container per grab-site? how to integrate tmux, if needed?) 2) double my manual testing matrix.

So, please, have at it and promote your fork/Dockerfile here. If you (or someone else) stays interested in maintaining and testing it, I might take a PR in the future.

notslang commented 5 years ago

what is the status of this PR?

@semente I've been using it pretty often for my own projects, and it works fine, but I haven't rebased it since 2016. I'll try rebasing and pushing a new image to the Docker hub.

For now, I would like someone else to be the Dockerized grab-site upstream

Ok, I'll keep an image updated over here: https://cloud.docker.com/u/slang800/repository/docker/slang800/grab-site

gabefair commented 4 years ago

@notslang Thank you for all this work. Can you confirm that your fork still works fine? I am curious if you ran into any issues or discovered anything of note.

818S commented 3 years ago

https://cloud.docker.com/u/slang800/repository/docker/slang800/grab-site

It says updated 3 years ago, any plans to update it?

Or any plans to officially ship a Dockerfile for this?

brandongalbraith commented 2 years ago

FYI this third party grab-site Dockerfile currently works as of this comment being posted: https://github.com/Nold360/docker-grab-site.

https://registry.hub.docker.com/r/nold360/grab-site/