Spark Docker support: avoid need for looking up username in /etc/passwd

chapmanb commented 6 years ago

Hi all; I'm running into an issue when running GATK Spark based tools inside of Docker containers. Spark tries to look up the current username as part of initialization:

https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-sparkcontext-creating-instance-internals.adoc#-utilsgetcurrentusername

while fails in Docker container where the user ID is not present in /etc/passwd. This SO thread has a pretty good summary of the problem along with some hacky work arounds:

https://stackoverflow.com/questions/45198252/apache-spark-standalone-for-anonymous-uid-without-user-name

Is it possible to avoid needing Spark to login via username? Do you have any other tips/clues to work around this issue when running GATK Spark inside of container environments?

droazen commented 6 years ago

@chapmanb We haven't encountered this issue before, since we always run as root in our docker containers, which is conveniently always present in /etc/passwd. Is sudoing to root an option for you, or do you need to run as another user within your container? Are you using the official GATK docker image from https://hub.docker.com/r/broadinstitute/gatk/, or your own?

chapmanb commented 6 years ago

Thanks much for this. Unfortunately running as root isn't an option since external CWL runners (like bunny, Toil, cwltool and hopefully Cromwell soon) make the decisions about the username to use and will try to mirror the runner with the external user to match user permissions on the output files. Also people trust it more when you're not trying to run as root (hence, not wanting to mess with /etc/passwd in the Docker container for fix Spark as well).

This was using the bcbio-vc Docker image (https://github.com/bcbio/bcbio_docker#docker-images) with gatk installed via bioconda, but I don't think is image specific unless you're specifically doing something in your images to work around the problem which is doesn't sound like.

Is there any chance to tweak Spark to make it less picky/dependent on the user? I'm not enough of a Spark expert to know if this is work-aroundable in a reasonable way?

droazen commented 6 years ago

@chapmanb Do you have any control over the docker run command? If you do, you could mount /etc/passwd//etc/shadow as external resources into the container (as described here, for example: https://stackoverflow.com/questions/33013444/can-not-add-new-user-in-docker-container-with-mounted-etc-passwd-and-etc-shado) . This seems slightly less awful than the workarounds you linked to above.

It seems to me, however, that running as a user not present in /etc/passwd could cause quite a few things to fail, not just Spark. Perhaps you should file a bug report against these CWL runners? They should really be configuring the runtime environment in the container properly for the username they force you to run under! If they are trying to mirror the external user within the container, they should probably mount the external passwd file into the container on your behalf when executing docker run.

@tomwhite Do you happen to know of a workaround on the Spark side to prevent it from doing a username lookup?

tomwhite commented 6 years ago

@chapmanb As far as I know Spark (and Hadoop) needs the user name to submit a job. Can you set the SPARK_USER environment variable to the user you would like to use (even if it is not in /etc/passwd)? You should be able to pass it to Docker with the -e option. (I haven't tried this to see if it works.)

chapmanb commented 6 years ago

David, thanks for the CWL suggestion. As far as I know most CWL runners don't attempt to edit or mount the internal container /etc/passwd unfortunately. They do try to match with the external user outside of Docker to avoid file permission issues. Does Cromwell deal with this problem? We're actively looking to make more use of Cromwell for CWL runs. If we could make that happen that would resolve a lot of issues and I could leave my workaround for other non-conforming callers.

Tom, that is a great suggestion and I thought would work as well but we do this (https://github.com/bcbio/bcbio-nextgen/blob/bd03e259877d410045468046a949f6b9724605c5/bcbio/broad/__init__.py#L152) and Spark/Hadoop still wants to look up the user in /etc/passwd even if missing. If there is a way to skip that being present I'm happy to tweak that as well.

Thanks so much for all this discussion and suggestions.

jjfarrell commented 6 years ago

Would Singularity a possible solution to this issue? It deals with the root issue to run container images on HPCC. There are tools to pull a docker image to a singularity image.

https://singularity.lbl.gov/about https://github.com/singularityware/singularity

chapmanb commented 6 years ago

Thanks for the thoughts. Singularity is definitely awesome and I'm hoping to support it as an alternative choice to Docker for local HPC clusters where we won't require equivalent root permissions to run. So it helps avoid some of the potential external permission errors by creating a potentially cleaner path to running. Unfortunately it doesn't deal with the underlying issue of needing to map users inside of the containers so that Spark is happy with them. Having something more lightweight than needing user updates in the internal /etc/passwd would also help with potential issues on other container enginer (Singularity, rkt).

jjfarrell commented 6 years ago

@chapmanb Singularity's default configuration has a line "config passwd = yes" and that will create a user entry in the /etc/passwd automatically. So it I understand the issue, spark would automatically find the user running the container in the /etc/passwd file.

chapmanb commented 6 years ago

Wow, brilliant, thanks for the heads up on this feature. That sounds perfect and would definitely sort us out on systems using Singularity. Thanks again for letting me know this will be much less painful with that approach.

droazen commented 6 years ago

@chapmanb I would file an issue in the cromwell repo and see if they could add a similar feature to auto-populate /etc/passwd within the docker container.

broadinstitute / gatk

Spark Docker support: avoid need for looking up username in /etc/passwd #4626