apptainer / singularity

Singularity has been renamed to Apptainer as part of us moving the project to the Linux Foundation. This repo has been persisted as a snapshot right before the changes.
https://github.com/apptainer/apptainer
Other
2.53k stars 424 forks source link

Long command causes "Failed to read JSON configuration from socket: Bad file descriptor" error #2428

Closed chrarnold closed 5 years ago

chrarnold commented 5 years ago

Version of Singularity:

3.0.1-73.ga50b5b3.el7

Expected behavior

the container runs the command. This particular container and Singularity works fine on our system and just fails for the example described here.

Actual behavior

An error occurs: Failed to read JSON configuration from socket: Bad file descriptor

Steps to reproduce behavior

I believe this issue happens because of the extremely long command. In my example, the command that fails in a reproducible manner with the error message above has over 30,000 characters (essentially a list of hundreds of file path that collectively form one command).

The error happens here (see here, line 758):

if ( (ret = read(fds[0].fd, json_stdin, config.jsonConfSize)) != config.jsonConfSize ) {
                singularity_message(ERROR, "Failed to read JSON configuration from socket: %s\n", strerror(errno));
                exit(1);
            }
cclerget commented 5 years ago

Hi @chrarnold, I'm trying to naively reproduce it with singularity exec -B $(perl 'print "A"*100000') image, above 65536 characters singularity still block on socket write. Still investigating

chrarnold commented 5 years ago

Thanks for checking so quickly. I am curious what you find and whether it is really that issue. Nobody ever seems to have reported the error I am getting, and it happens only for this case, so it must be related to the specific command. Since the only "special" characteristic is its length, I don't know what else it could be right now.

cclerget commented 5 years ago

@chrarnold Here what I got after spending hours on this one. The error means the Go process died by closing all file descriptors that were shared with C parent process (CLONE_FILES during fork), this is not supposed to happen because the file descriptor used by C process is not explicitly close or even touched by the Go process. I verified another point by preventing Go process from exiting and to kill it from parent process after data was received (to ensure file descriptors are not closed), and also remove the call to set parent death signal in order to exclude some signal race condition and got same results with the script below.

This issue appears in a random manner under heavy cpu load, here the script I used to stress CPU while running singularity with big command line arguments :

#!/bin/bash

image="/home/ced/bundles/shell/rootfs"
cpus=$(nproc)
workers=$(($cpus*$cpus))
stress --cpu $workers --timeout 300 &
big=$(perl -e 'print "A"x60000')

echo 3|sudo tee /proc/sys/vm/drop_caches

sleep 5

for i in $(seq 1 100); do
    singularity exec -B $big $image id 2>/dev/null
    ret=$?
    echo -en "run $i: "
    if [ $ret -eq 1 ]; then
        echo "failure"
    else
        echo "ok"
    fi
done

killall stress

When failure is displayed that means Failed to read JSON configuration from socket occured, with current tree the rate failure with this script run multiple time is between 35-50, now with this fix in my branch : https://github.com/cclerget/singularity/commit/59d1f235e162c1b71248f0cd97cd7c7c955f50f4#diff-43c7705401fa82763da9a77ad2c7b0b2R47 the rate failure is about 10-22 (run multiple time too).

The fix force socket to write data from Go mainthread. Rate failure increase when write isn't done in Go mainthread or if GOMAXPROCS is not equal to one (with or without fix), so looks like a thread issue, I'm wonder if it's not related to systemd cgroups (even if there is no restriction set by systemd) or it's a Go thread/runtime bug or a kernel bug.

So here few questions :

Reproduce with:

chrarnold commented 5 years ago

Hi, wow, that looks like a nasty bug... Great job in digging so deep into it, I hope we identify the real cause. I spoke to the admins, here is the information for the machine where the errors occurs reproducibly (I tested this 2-3 times, and it crashed in all cases - from your description, it sounds more like it sometimes worked and sometimes did not, just as a side note):


$ cat /etc/centos-release
CentOS Linux release 7.4.1708 (Core) 

$ uname -a
Linux [hidden] 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$ go
-bash: go: command not found

Also, it is systemd managed, yes.

Until this is fixed, do you suggest to use your branch and hopefully it works there, or is there an alternative solution that you can think of?

Thank you!

cclerget commented 5 years ago

@chrarnold You can give a try with my branch and if that works, well I will submit a PR to merge into master branch

mboisson commented 5 years ago

Hi. I am running into this issue as well, but not with a long command, with a large environment in the host's shell session.

$ /opt/software/singularity-3.0/bin/singularity shell --nv pytorch-18.01-py3.simg
WARNING: ldconfig execution error: exit status 1
INFO:    Unable to capture nvidia bind points: exit status 1
Singularity pytorch-18.01-py3.simg:~> exit
exit
$ env | wc
    160     189   14362
$ export VAR=somevalue$VAR$VAR$VAR
$ env | wc
    160     189   14425
$ /opt/software/singularity-3.0/bin/singularity shell --nv pytorch-18.01-py3.simg
WARNING: ldconfig execution error: exit status 1
INFO:    Unable to capture nvidia bind points: exit status 1
ERROR  : Failed to read JSON configuration from socket: Bad file descriptor``` 

The $VAR environment variable obviously does not have anything to do with any logic, it is just there to increase the size of the environment.

titansmc commented 5 years ago

Hi @cclerget , how should I build the RPM package to test it? I am getting some errors when building it as per singularity instructions. Thanks.

jscook2345 commented 5 years ago

@titansmc

As of version 3.0.1, you can just run ./mconfig && make -C builddir rpm to build an RPM.

After you checkout @cclerget 's branch you should be able to than build the RPM with the commands above. Let me know if that does not work.

jscook2345 commented 5 years ago

Possibly related #2445

titansmc commented 5 years ago

Hi @jscook2345 this is what I get:

[root@haproxy03 singularity]# ./mconfig 
=> running base system checks ...
 checking: host C compiler... cc
 checking: host C++ compiler... c++
 checking: host `ar' path... ar
 checking: host `ld' path... ld
 checking: host `ranlib' path... ranlib
 checking: host `objcopy' path... objcopy
 checking: target C compiler... cc
 checking: target C++ compiler... c++
 checking: target `ar' path... ar
 checking: target `ld' path... ld
 checking: target `ranlib' path... ranlib
 checking: target `objcopy' path... objcopy
 checking: host compiles static binaries... no
 checking: target compiles static binaries... no
 checking: host os type... unix
 checking: host architecture... x86_64
 checking: target architecture... x86_64
 checking: host architecture word size... 64
 checking: target architecture word size... 64
=> running project specific checks ...
 checking: namespace: CLONE_NEWPID... yes
 checking: namespace: CLONE_FS... yes
 checking: namespace: CLONE_NEWNS... yes
 checking: namespace: CLONE_NEWUSER... yes
 checking: namespace: CLONE_NEWIPC... yes
 checking: namespace: CLONE_NEWNET... yes
 checking: namespace: CLONE_NEWUTS... yes
 checking: namespace: CLONE_NEWCGROUP... no
 checking: feature: NO_NEW_PRIVS... yes
 checking: feature: MS_SLAVE... yes
 checking: feature: MS_REC... yes
 checking: feature: MS_PRIVATE... yes
 checking: user capabilities... yes
 checking: header linux/securebits.h... yes
 checking: header linux/capability.h... yes
 checking: function setns... no
 checking: libssl+headers... yes
 checking: libuuid+headers... yes
 checking: libseccomp+headers... yes
INFO: userns and disable_suid are disabled. (mlocal/checks/basechecks.chk)
=> generating fragments ...
 found build modules:
   +-> internal/pkg/runtime/c/lib/util/util.mconf
=> building Makefile ...
=> generating singularity.spec ...
=> project setup with :
    - target arch: x86_64
    - target compiler: cc
    - target wordsize: 64-bit
    - host arch: x86_64
    - host wordsize: 64-bit
    - host system: unix
    - config profile: release
    - verbose: no
=> builddir/Makefile ready, try:
   $ cd builddir
   $ make
[root@haproxy03 singularity]# make -C builddir rpm
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/startergithub.com/sylabs/singularity/internal/app/starter’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/startergithub.com/sylabs/singularity/internal/pkg/build/types’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/startergithub.com/sylabs/singularity/internal/pkg/buildcfg’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/startergithub.com/sylabs/singularity/internal/pkg/cgroups’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/startergithub.com/sylabs/singularity/internal/pkg/image’: No such file or directory
.
.
.
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/go4.org/errorutil’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/cast5’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/openpgp’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/openpgp/armor’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/openpgp/clearsign’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/openpgp/elgamal’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/openpgp/errors’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/openpgp/packet’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/openpgp/s2k’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/ssh/terminal’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/net/context’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/net/context/ctxhttp’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/net/internal/socks’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/net/proxy’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/sys/unix’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/gopkg.in/cheggaaa/pb.v1’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/gopkg.in/yaml.v2’: No such file or directory
make: Entering directory `/root/go/src/github.com/cclerget/singularity/builddir'
 DIST setup VERSION: 59d1f23-1
 DIST create tarball: singularity-59d1f23.tar.gz
 BUILD RPM
error: File /root/go/src/github.com/cclerget/singularity/singularity-59d1f235e162c1b71248f0cd97cd7c7c955f50f4.tar.gz: No such file or directory
/usr/bin/cat: /root/go/src/github.com/cclerget/singularity/singularity-59d1f235e162c1b71248f0cd97cd7c7c955f50f4.tar.gz: No such file or directory
error: File /root/go/src/github.com/cclerget/singularity/singularity-59d1f235e162c1b71248f0cd97cd7c7c955f50f4.tar.gz: No such file or directory
/usr/bin/cat: /root/go/src/github.com/cclerget/singularity/singularity-59d1f235e162c1b71248f0cd97cd7c7c955f50f4.tar.gz: No such file or directory
error: Failed to read spec file from /root/go/src/github.com/cclerget/singularity/singularity-59d1f235e162c1b71248f0cd97cd7c7c955f50f4.tar.gz
make: *** [rpm] Error 1
make: Leaving directory `/root/go/src/github.com/cclerget/singularity/builddir'
titansmc commented 5 years ago

Hi, Is there any update on how I can generate RPM for cclerget/singularity ?Thanks.

chrarnold commented 5 years ago

Hi, we got this error again today, but for a different example with a rather short command (less than 1000 characters), so command length cannot be an issue. What is super weird is the following: This error ONLY occurs when executed from within Snakemake (a workflow manager with Singularity support). When I run the very same command on the command line outside of Snakemake, it works just fine. Any idea what could be the reason for this? The exact command does not matter, but this is how Singularity is called:

singularity exec --home /scratch/bla/input --bind /scratch --bind /g/easybuild/x86_64/CentOS/7/nehalem/software/snakemake/5.4.0-foss-2017b-Python-3.6.4/lib/python3.6/site-packages:/mnt/snakemake /bla3/Snakemake/singularity/6729945a2490413ce273448537b30684.simg bash -c 'some command'

Snakemake is written in Python (3.6, the version that causes the error), and I guess just a system call function is employed to run the singularity command, how could that have an effect?

You mentioned that this type of error is related to CPU load, and while it is certainly true that when run within Snakemake, the CPU load is probably a bit higher due to Snakemake monitoring and building the workflow, but certainly not so much that this error would occur - and it worked for upstream parts of the pipeline just fine, using the same Singularity installation (singularity version 3.0.2-1.el7).

chrarnold commented 5 years ago

Update: The error also relates to which Snakemake installation is considered... When I use the software that is centrally installed from IT via "module load", the error comes up. However, when using the exact same Snakemake version that I installed personally via conda, it works just fine. Might this be related to "Go" also somehow? This looks like a bug to me and not like an event that is just unlikely to occur: The error happened in a predictable and repeatable fashion, using different users and machines. Just changing the installation path of Snakemake (which calls singularity from within Python) then suddenly worked... Mysterious.

mboisson commented 5 years ago

@chrarnold, see my own comment earlier. I suspect that this has to do with "large environment", not "long command".

https://github.com/sylabs/singularity/issues/2428#issuecomment-445880515

chrarnold commented 5 years ago

Thanks, I see, that could also be a good explanation yes...

env | wc 188 212 40304

So also very long for the "module load" version that failed to work.

For the one that works: env | wc
63 87 4193

cclerget commented 5 years ago

@chrarnold @mboisson Fix has been pushed in master branch, please give it a try and re-open this issue if fix doesn't work. Thanks !

chrarnold commented 5 years ago

Thanks! Two quick questions: 1) Will this be part of 3.1.0 only or what is the next release in which this fix is integrated? 2) From the first diagnosis described above, this sounded like a complicated bug with the first fix just lowering the probability that this happens - the newest fix is similar or fixes it altogether? Thanks!

ketan0 commented 5 years ago

@cclerget I am getting the same error when trying to run a command in a Docker environment through singularity on a computing cluster—here's the docker binary: https://hub.docker.com/r/amsan7/openpose/. Also here's the command I execute, not that this would necessarily help:

singularity exec --nv $SINGULARITY_CACHEDIR/openpose-latest.sif bash -c \ 'cd /openpose-master && ./build/examples/openpose/openpose.bin \ --no_display true \ --render_pose 0 \ --video $vid \ --frame_rotate 180 \ --face \ --hand \ --write_keypoint_json $output_dir'

So the above comments say that this related to environment size? how would one check that? I am on singularity version 3.0.3-1.el7. Is this a version where it should be fixed? Any help would be appreciated.

jscook2345 commented 5 years ago

@ketan0

You should be able to test this with the 3.1.0 release candidate.

https://github.com/sylabs/singularity/releases/tag/v3.1.0-rc2

ketan0 commented 5 years ago

Thanks for that. Do you have a timeframe for when the official version will be released?

GodloveD commented 5 years ago

If we don't find any more bugs and if we can get all of the existing bugs fixed and the third release candidate into code freeze and fully tested in time... Monday? But that is seeming unlikely at this point. 😸