ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Full Docker support #195

Open acrois opened 2 years ago

acrois commented 2 years ago

Based on the initial work done by Fusl (credits preserved), the Dockerfile has been updated and improved as well as all its usage documented in the README. It strives to address all of the usability concerns mentioned amongst several other issues on this topic.

Please review the usage to begin running Docker grab-site on your machine

This new image is based on Alpine 3.13 and Python 3.7. The advantage with this new addition is that no previous support exists, and this breaks no existing installations only broadens them. Maintaining this will be easy and can be integrated into Travis CI. If you wish to run multiple grab-site instances, you may want to mount a different volume to your host machine for each container.

One limitation, which is unavoidable unless we make some slight changes to the existing installation process, is that external dependencies (Python libs defined in setup.py) are not able to be cached so any changes to the code will cause the dependencies to be downloaded fresh on a subsequent build.

Ideally, we would download these dependencies as a separate build step and then copy the code over and run the setup. It may be as simple as building the image and then running pip freeze > requirements.txt and then pip install but further testing would be required to determine what paths setup.py standalone vs pip install would use and didn't seem important enough to warrant the complexity of figuring it out.

There is no Docker Registry usage. If this is desired by the community in the future, it can be added simply by additional steps using Travis CI if we can integrate it and have it automatically deploy to GitHub Packages and/or Docker Hub from successful Travis CI build.

Finally, I will provide support to maintain and improve this image and its usage/deployment in the future.

brandongalbraith commented 2 years ago

@acrois THANK YOU for your work on this πŸ‘πŸ‘πŸ‘

ivan commented 2 years ago

Since I am somewhat occupied at the moment: would a grab-site user kindly test this and all of the new README instructions? Feel free to post feedback or code review.

acrois commented 2 years ago

In order to simplify the data storage and improve portability, I have made several improvements to the design of the Docker environment and taking advantage of some default behaviors, can completely remove the need to specify the "--dir" flag at runtime by default.

I was working on a side project, related to grab-site and I discovered a pain when I was trying to automate running it in a cluster.

I ended up isolating all grab-site related processes to their own network, for increased security.

grab-site stores data in whatever current working directory you are in, so I simply added grab-site's install dir to the PATH so that references to scripts need "./" to refer to them. This allows us to simplify specifying one argument for persistence -v ./data:/data:rw.

Quick start usage as well as changes regarding this change can be found in the updated README.

Let me know if this is reproducible for you. I've run it on both Windows and Debian Docker hosts.

I still plan on updating more information on what each parameter in the docker run does but it won't be functionally changing at all.

ivan commented 2 years ago

Sorry for the delay, I had a lot of things going on.

I followed the updated instructions (though I substituted grab-network with gs-network) on NixOS and got:

at@dev:/home/at/code/archiving/grab-site:docker-pr-2# docker run --net=gs-network --rm -d -e GRAB_SITE_HOST=gs-server -v ./data:/data:rw grab-site:latest grab-site https://example.com/
docker: Error response from daemon: create ./data: "./data" includes invalid characters for a local volume name, only "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed. If you intended to pass a host directory, use absolute path.
See 'docker run --help'.

at@dev:/home/at/code/archiving/grab-site:docker-pr-2# docker --version
Docker version 20.10.7, build v20.10.7

Did I do something wrong? Do I need to make some directory exist first or create some volume?

In general, if we are to support Docker, I would really like to have instructions that would work for a first-time Docker user, with everything made explicit. Otherwise people will generally just give up.

ivan commented 2 years ago

Ah, I see # Ensure that ./data exists and container has write privileges on volume, so perhaps it should say

mkdir -p data

and then -v $PWD/data:/data:rw in the docker run args.

ivan commented 2 years ago

When I start a grab-site container as per the Docker instructions, my containers appear to die before they do anything. journalctl -b -u docker shows

Aug 28 01:42:45 dev f4baf8e0cffb[188335]: + exec grab-site https://example.com/
Aug 28 01:42:46 dev f4baf8e0cffb[188335]: Traceback (most recent call last):
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:   File "/app/grab-site", line 4, in <module>
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:     main.main()
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:   File "/home/grab-site/.local/lib/python3.7/site-packages/click/core.py", line 1137, in __call__
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:     return self.main(*args, **kwargs)
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:   File "/home/grab-site/.local/lib/python3.7/site-packages/click/core.py", line 1062, in main
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:     rv = self.invoke(ctx)
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:   File "/home/grab-site/.local/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:     return ctx.invoke(self.callback, **ctx.params)
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:   File "/home/grab-site/.local/lib/python3.7/site-packages/click/core.py", line 763, in invoke
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:     return __callback(*args, **kwargs)
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:   File "/app/libgrabsite/main.py", line 293, in main
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:     os.makedirs(working_dir)
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:   File "/usr/local/lib/python3.7/os.py", line 223, in makedirs
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:     mkdir(name, mode)
Aug 28 01:42:46 dev f4baf8e0cffb[188335]: PermissionError: [Errno 13] Permission denied: '/data/example.com-2021-08-28-d9c5308a'

We probably need instructions for "Ensure that [...] container has write privileges on volume" as well if we are to support this.

chmod 0777 data appears to work but I would like to have the proper solution.

ivan commented 2 years ago

regarding the permissions, I am told

"Fusl's Dockerfile simply ran grab-site as root inside the container, by the way. Not ideal, but since it's isolated to that container, probably fine? I'm not a Docker expert though.

The Dockerfile in the PR creates a user and group inside the container, which is cleaner, but then that uid/gid needs to have write access in the mountpoint as well."

Nothing4You commented 2 years ago

to my knowledge there's basically 2 approaches for this.

  1. launch as root, have an init script that performs initial setup stuff like chowning work directrories and afterwards dropping to a non-root user
  2. set a static uid the container runs as in the Dockerfile

you can also combine these as for example https://www.linuxserver.io/ does, the init script checks if it's running as root and it has env vars for dropping privilege, then it chowns the working directory and drops to the final user.

effectively there's also a third option that can be used regardless of the previous configuration, however, depending on which of the previous configurations are chosen this can be tricky to implement. with the first option the user just passes uids of the user to run as env var, but you can also launch containers directly with specified uid/gid.

if you however specify VOLUMEs in the Dockerfile they are chowned (except for bind mounts) on creation to the user that owned the base folder at that time. while this works just fine for bind mounts like -v ./data:/data:rw as you can just chown them manually if desired this will not work for volume mounts. with volumes you typically don't have an easy way to change permissions/ownership in them.

with bind mounts you also won't get the files and folders initialized inside the volume if there are any, this only works with volumes.

also to note is that some applications may run into issues if they run as a uid that does not map to a user (i.e. no matching entry in /etc/passwd).

acrois commented 2 years ago

@ivan @Nothing4You

Thank you for the review, I have been a little busy so I just wanted to let you know I'm working on addressing the comments soon. I'll tag again when I've got some answers.

JustAnotherArchivist commented 2 years ago

One more comment: the paths in --input-file, --import-ignores, --dir, --finished-warc-dir, and --wpull-args obviously refer to paths in the container. I don't think there is a way to avoid that, but it should be documented how to work around it by using -v to map these to files in the container and then specifying the container paths in the command.

raspher commented 1 year ago

Please provide information about selinux bind mounting to avoid spam of unneeded issues (just use Z instead of rw please)

https://stackoverflow.com/questions/24288616/permission-denied-on-accessing-host-directory-in-docker

Ghost-Terms commented 1 year ago

Is this still being worked on?

acrois commented 1 year ago

@ImportTaste It should be complete, I just pulled the latest changes from here into my fork/branch so your build will be against the latest version of the code.

Please let me know if you follow the usage instructions and it does not work so I can improve it.


TL;DR: I'd call this the best starting point for anyone trying to run grab-site and gs-server on Docker. It might not be 100% complete but I've been running it ever since I submitted the PR on several VMs (AWS, Azure, GCP, VirtualBox, Proxmox), so it works on my machines! πŸ˜„


It would be great if this could be merged and then a separate issue can be created with an accurate report of how to reproduce these different environments. That way the foundational support isn't just hanging out forever on my fork.

That being said, I have recently (last 9mo) documented setting up user/permissions between the host and container on a bone standard Debian install with an additional user. I believe that could address the SELinux user host<->container mismatch and allow it to run on those systems with whatever flag set (rw vs Z) but I don't have the time to try and figure out how to reproduce everyone else's configuration between selinux, nix/os, and other setups from scratch.

I'd love to do it, I just need some help. So, instead of making assumptions, I'd appreciate a script and target OS that I can use to help me set up everyone's environment(s) locally in order to reproduce this last remaining issue to prepare a proper solution that is compatible with all configurations.

xaiki commented 7 months ago

this seems to require pip install --no-binary lxml to work.

acrois commented 5 months ago

pip install --no-binary lxml

https://github.com/acrois/grab-site/blob/feature/dockerfile/Dockerfile#L47-L48

I know that pip is not ideal but this is supposed to cover the app dependency scenario already. If it is necessary it should be included as part of the list of dependencies, perhaps here in setup.py:

https://github.com/acrois/grab-site/blob/feature/dockerfile/setup.py#L12-L20


image

I added a build script that will take care of release needs.

Please send me a reproduction using a command that leverages the new images.

docker pull ghcr.io/acrois/grab-site:sha-3ee0d2f

Perhaps we need a test case added that complements the build.yaml workflow?

Yes or yes?