Open chapmanb opened 6 years ago
@chapmanb We haven't encountered this issue before, since we always run as root
in our docker containers, which is conveniently always present in /etc/passwd
. Is sudoing to root an option for you, or do you need to run as another user within your container? Are you using the official GATK docker image from https://hub.docker.com/r/broadinstitute/gatk/
, or your own?
Thanks much for this. Unfortunately running as root isn't an option since external CWL runners (like bunny, Toil, cwltool and hopefully Cromwell soon) make the decisions about the username to use and will try to mirror the runner with the external user to match user permissions on the output files. Also people trust it more when you're not trying to run as root (hence, not wanting to mess with /etc/passwd
in the Docker container for fix Spark as well).
This was using the bcbio-vc Docker image (https://github.com/bcbio/bcbio_docker#docker-images) with gatk installed via bioconda, but I don't think is image specific unless you're specifically doing something in your images to work around the problem which is doesn't sound like.
Is there any chance to tweak Spark to make it less picky/dependent on the user? I'm not enough of a Spark expert to know if this is work-aroundable in a reasonable way?
@chapmanb Do you have any control over the docker run
command? If you do, you could mount /etc/passwd
//etc/shadow
as external resources into the container (as described here, for example: https://stackoverflow.com/questions/33013444/can-not-add-new-user-in-docker-container-with-mounted-etc-passwd-and-etc-shado
) . This seems slightly less awful than the workarounds you linked to above.
It seems to me, however, that running as a user not present in /etc/passwd
could cause quite a few things to fail, not just Spark. Perhaps you should file a bug report against these CWL runners? They should really be configuring the runtime environment in the container properly for the username they force you to run under! If they are trying to mirror the external user within the container, they should probably mount the external passwd
file into the container on your behalf when executing docker run
.
@tomwhite Do you happen to know of a workaround on the Spark side to prevent it from doing a username lookup?
@chapmanb As far as I know Spark (and Hadoop) needs the user name to submit a job. Can you set the SPARK_USER
environment variable to the user you would like to use (even if it is not in /etc/passwd)? You should be able to pass it to Docker with the -e
option. (I haven't tried this to see if it works.)
David, thanks for the CWL suggestion. As far as I know most CWL runners don't attempt to edit or mount the internal container /etc/passwd
unfortunately. They do try to match with the external user outside of Docker to avoid file permission issues. Does Cromwell deal with this problem? We're actively looking to make more use of Cromwell for CWL runs. If we could make that happen that would resolve a lot of issues and I could leave my workaround for other non-conforming callers.
Tom, that is a great suggestion and I thought would work as well but we do this (https://github.com/bcbio/bcbio-nextgen/blob/bd03e259877d410045468046a949f6b9724605c5/bcbio/broad/__init__.py#L152) and Spark/Hadoop still wants to look up the user in /etc/passwd
even if missing. If there is a way to skip that being present I'm happy to tweak that as well.
Thanks so much for all this discussion and suggestions.
Would Singularity a possible solution to this issue? It deals with the root issue to run container images on HPCC. There are tools to pull a docker image to a singularity image.
https://singularity.lbl.gov/about https://github.com/singularityware/singularity
Thanks for the thoughts. Singularity is definitely awesome and I'm hoping to support it as an alternative choice to Docker for local HPC clusters where we won't require equivalent root permissions to run. So it helps avoid some of the potential external permission errors by creating a potentially cleaner path to running. Unfortunately it doesn't deal with the underlying issue of needing to map users inside of the containers so that Spark is happy with them. Having something more lightweight than needing user updates in the internal /etc/passwd
would also help with potential issues on other container enginer (Singularity, rkt).
@chapmanb Singularity's default configuration has a line "config passwd = yes" and that will create a user entry in the /etc/passwd automatically. So it I understand the issue, spark would automatically find the user running the container in the /etc/passwd file.
Wow, brilliant, thanks for the heads up on this feature. That sounds perfect and would definitely sort us out on systems using Singularity. Thanks again for letting me know this will be much less painful with that approach.
@chapmanb I would file an issue in the cromwell repo and see if they could add a similar feature to auto-populate /etc/passwd
within the docker container.
Hi all; I'm running into an issue when running GATK Spark based tools inside of Docker containers. Spark tries to look up the current username as part of initialization:
https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-sparkcontext-creating-instance-internals.adoc#-utilsgetcurrentusername
while fails in Docker container where the user ID is not present in
/etc/passwd
. This SO thread has a pretty good summary of the problem along with some hacky work arounds:https://stackoverflow.com/questions/45198252/apache-spark-standalone-for-anonymous-uid-without-user-name
Is it possible to avoid needing Spark to login via username? Do you have any other tips/clues to work around this issue when running GATK Spark inside of container environments?