Closed chrarnold closed 5 years ago
Hi @chrarnold, I'm trying to naively reproduce it with singularity exec -B $(perl 'print "A"*100000') image
, above 65536 characters singularity still block on socket write. Still investigating
Thanks for checking so quickly. I am curious what you find and whether it is really that issue. Nobody ever seems to have reported the error I am getting, and it happens only for this case, so it must be related to the specific command. Since the only "special" characteristic is its length, I don't know what else it could be right now.
@chrarnold Here what I got after spending hours on this one. The error means the Go process died by closing all file descriptors that were shared with C parent process (CLONE_FILES during fork), this is not supposed to happen because the file descriptor used by C process is not explicitly close or even touched by the Go process. I verified another point by preventing Go process from exiting and to kill it from parent process after data was received (to ensure file descriptors are not closed), and also remove the call to set parent death signal in order to exclude some signal race condition and got same results with the script below.
This issue appears in a random manner under heavy cpu load, here the script I used to stress CPU while running singularity with big command line arguments :
#!/bin/bash
image="/home/ced/bundles/shell/rootfs"
cpus=$(nproc)
workers=$(($cpus*$cpus))
stress --cpu $workers --timeout 300 &
big=$(perl -e 'print "A"x60000')
echo 3|sudo tee /proc/sys/vm/drop_caches
sleep 5
for i in $(seq 1 100); do
singularity exec -B $big $image id 2>/dev/null
ret=$?
echo -en "run $i: "
if [ $ret -eq 1 ]; then
echo "failure"
else
echo "ok"
fi
done
killall stress
When failure is displayed that means Failed to read JSON configuration from socket
occured, with current tree the rate failure with this script run multiple time is between 35-50, now with this fix in my branch : https://github.com/cclerget/singularity/commit/59d1f235e162c1b71248f0cd97cd7c7c955f50f4#diff-43c7705401fa82763da9a77ad2c7b0b2R47 the rate failure is about 10-22 (run multiple time too).
The fix force socket to write data from Go mainthread. Rate failure increase when write isn't done in Go mainthread or if GOMAXPROCS is not equal to one (with or without fix), so looks like a thread issue, I'm wonder if it's not related to systemd cgroups (even if there is no restriction set by systemd) or it's a Go thread/runtime bug or a kernel bug.
So here few questions :
Reproduce with:
Hi, wow, that looks like a nasty bug... Great job in digging so deep into it, I hope we identify the real cause. I spoke to the admins, here is the information for the machine where the errors occurs reproducibly (I tested this 2-3 times, and it crashed in all cases - from your description, it sounds more like it sometimes worked and sometimes did not, just as a side note):
$ cat /etc/centos-release
CentOS Linux release 7.4.1708 (Core)
$ uname -a
Linux [hidden] 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ go
-bash: go: command not found
Also, it is systemd managed, yes.
Until this is fixed, do you suggest to use your branch and hopefully it works there, or is there an alternative solution that you can think of?
Thank you!
@chrarnold You can give a try with my branch and if that works, well I will submit a PR to merge into master branch
Hi. I am running into this issue as well, but not with a long command, with a large environment in the host's shell session.
$ /opt/software/singularity-3.0/bin/singularity shell --nv pytorch-18.01-py3.simg
WARNING: ldconfig execution error: exit status 1
INFO: Unable to capture nvidia bind points: exit status 1
Singularity pytorch-18.01-py3.simg:~> exit
exit
$ env | wc
160 189 14362
$ export VAR=somevalue$VAR$VAR$VAR
$ env | wc
160 189 14425
$ /opt/software/singularity-3.0/bin/singularity shell --nv pytorch-18.01-py3.simg
WARNING: ldconfig execution error: exit status 1
INFO: Unable to capture nvidia bind points: exit status 1
ERROR : Failed to read JSON configuration from socket: Bad file descriptor```
The $VAR
environment variable obviously does not have anything to do with any logic, it is just there to increase the size of the environment.
Hi @cclerget , how should I build the RPM package to test it? I am getting some errors when building it as per singularity instructions. Thanks.
@titansmc
As of version 3.0.1, you can just run ./mconfig && make -C builddir rpm
to build an RPM.
After you checkout @cclerget 's branch you should be able to than build the RPM with the commands above. Let me know if that does not work.
Possibly related #2445
Hi @jscook2345 this is what I get:
[root@haproxy03 singularity]# ./mconfig
=> running base system checks ...
checking: host C compiler... cc
checking: host C++ compiler... c++
checking: host `ar' path... ar
checking: host `ld' path... ld
checking: host `ranlib' path... ranlib
checking: host `objcopy' path... objcopy
checking: target C compiler... cc
checking: target C++ compiler... c++
checking: target `ar' path... ar
checking: target `ld' path... ld
checking: target `ranlib' path... ranlib
checking: target `objcopy' path... objcopy
checking: host compiles static binaries... no
checking: target compiles static binaries... no
checking: host os type... unix
checking: host architecture... x86_64
checking: target architecture... x86_64
checking: host architecture word size... 64
checking: target architecture word size... 64
=> running project specific checks ...
checking: namespace: CLONE_NEWPID... yes
checking: namespace: CLONE_FS... yes
checking: namespace: CLONE_NEWNS... yes
checking: namespace: CLONE_NEWUSER... yes
checking: namespace: CLONE_NEWIPC... yes
checking: namespace: CLONE_NEWNET... yes
checking: namespace: CLONE_NEWUTS... yes
checking: namespace: CLONE_NEWCGROUP... no
checking: feature: NO_NEW_PRIVS... yes
checking: feature: MS_SLAVE... yes
checking: feature: MS_REC... yes
checking: feature: MS_PRIVATE... yes
checking: user capabilities... yes
checking: header linux/securebits.h... yes
checking: header linux/capability.h... yes
checking: function setns... no
checking: libssl+headers... yes
checking: libuuid+headers... yes
checking: libseccomp+headers... yes
INFO: userns and disable_suid are disabled. (mlocal/checks/basechecks.chk)
=> generating fragments ...
found build modules:
+-> internal/pkg/runtime/c/lib/util/util.mconf
=> building Makefile ...
=> generating singularity.spec ...
=> project setup with :
- target arch: x86_64
- target compiler: cc
- target wordsize: 64-bit
- host arch: x86_64
- host wordsize: 64-bit
- host system: unix
- config profile: release
- verbose: no
=> builddir/Makefile ready, try:
$ cd builddir
$ make
[root@haproxy03 singularity]# make -C builddir rpm
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/startergithub.com/sylabs/singularity/internal/app/starter’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/startergithub.com/sylabs/singularity/internal/pkg/build/types’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/startergithub.com/sylabs/singularity/internal/pkg/buildcfg’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/startergithub.com/sylabs/singularity/internal/pkg/cgroups’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/startergithub.com/sylabs/singularity/internal/pkg/image’: No such file or directory
.
.
.
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/go4.org/errorutil’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/cast5’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/openpgp’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/openpgp/armor’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/openpgp/clearsign’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/openpgp/elgamal’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/openpgp/errors’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/openpgp/packet’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/openpgp/s2k’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/crypto/ssh/terminal’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/net/context’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/net/context/ctxhttp’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/net/internal/socks’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/net/proxy’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/golang.org/x/sys/unix’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/gopkg.in/cheggaaa/pb.v1’: No such file or directory
find: ‘/root/go/src/github.com/cclerget/singularity/cmd/singularitygithub.com/sylabs/singularity/vendor/gopkg.in/yaml.v2’: No such file or directory
make: Entering directory `/root/go/src/github.com/cclerget/singularity/builddir'
DIST setup VERSION: 59d1f23-1
DIST create tarball: singularity-59d1f23.tar.gz
BUILD RPM
error: File /root/go/src/github.com/cclerget/singularity/singularity-59d1f235e162c1b71248f0cd97cd7c7c955f50f4.tar.gz: No such file or directory
/usr/bin/cat: /root/go/src/github.com/cclerget/singularity/singularity-59d1f235e162c1b71248f0cd97cd7c7c955f50f4.tar.gz: No such file or directory
error: File /root/go/src/github.com/cclerget/singularity/singularity-59d1f235e162c1b71248f0cd97cd7c7c955f50f4.tar.gz: No such file or directory
/usr/bin/cat: /root/go/src/github.com/cclerget/singularity/singularity-59d1f235e162c1b71248f0cd97cd7c7c955f50f4.tar.gz: No such file or directory
error: Failed to read spec file from /root/go/src/github.com/cclerget/singularity/singularity-59d1f235e162c1b71248f0cd97cd7c7c955f50f4.tar.gz
make: *** [rpm] Error 1
make: Leaving directory `/root/go/src/github.com/cclerget/singularity/builddir'
Hi, Is there any update on how I can generate RPM for cclerget/singularity ?Thanks.
Hi, we got this error again today, but for a different example with a rather short command (less than 1000 characters), so command length cannot be an issue. What is super weird is the following: This error ONLY occurs when executed from within Snakemake (a workflow manager with Singularity support). When I run the very same command on the command line outside of Snakemake, it works just fine. Any idea what could be the reason for this? The exact command does not matter, but this is how Singularity is called:
singularity exec --home /scratch/bla/input --bind /scratch --bind /g/easybuild/x86_64/CentOS/7/nehalem/software/snakemake/5.4.0-foss-2017b-Python-3.6.4/lib/python3.6/site-packages:/mnt/snakemake /bla3/Snakemake/singularity/6729945a2490413ce273448537b30684.simg bash -c 'some command'
Snakemake is written in Python (3.6, the version that causes the error), and I guess just a system call function is employed to run the singularity command, how could that have an effect?
You mentioned that this type of error is related to CPU load, and while it is certainly true that when run within Snakemake, the CPU load is probably a bit higher due to Snakemake monitoring and building the workflow, but certainly not so much that this error would occur - and it worked for upstream parts of the pipeline just fine, using the same Singularity installation (singularity version 3.0.2-1.el7).
Update: The error also relates to which Snakemake installation is considered... When I use the software that is centrally installed from IT via "module load", the error comes up. However, when using the exact same Snakemake version that I installed personally via conda, it works just fine. Might this be related to "Go" also somehow? This looks like a bug to me and not like an event that is just unlikely to occur: The error happened in a predictable and repeatable fashion, using different users and machines. Just changing the installation path of Snakemake (which calls singularity from within Python) then suddenly worked... Mysterious.
@chrarnold, see my own comment earlier. I suspect that this has to do with "large environment", not "long command".
https://github.com/sylabs/singularity/issues/2428#issuecomment-445880515
Thanks, I see, that could also be a good explanation yes...
env | wc 188 212 40304
So also very long for the "module load" version that failed to work.
For the one that works:
env | wc
63 87 4193
@chrarnold @mboisson Fix has been pushed in master branch, please give it a try and re-open this issue if fix doesn't work. Thanks !
Thanks! Two quick questions: 1) Will this be part of 3.1.0 only or what is the next release in which this fix is integrated? 2) From the first diagnosis described above, this sounded like a complicated bug with the first fix just lowering the probability that this happens - the newest fix is similar or fixes it altogether? Thanks!
@cclerget I am getting the same error when trying to run a command in a Docker environment through singularity on a computing cluster—here's the docker binary: https://hub.docker.com/r/amsan7/openpose/. Also here's the command I execute, not that this would necessarily help:
singularity exec --nv $SINGULARITY_CACHEDIR/openpose-latest.sif bash -c \ 'cd /openpose-master && ./build/examples/openpose/openpose.bin \ --no_display true \ --render_pose 0 \ --video $vid \ --frame_rotate 180 \ --face \ --hand \ --write_keypoint_json $output_dir'
So the above comments say that this related to environment size? how would one check that? I am on singularity version 3.0.3-1.el7. Is this a version where it should be fixed? Any help would be appreciated.
@ketan0
You should be able to test this with the 3.1.0 release candidate.
https://github.com/sylabs/singularity/releases/tag/v3.1.0-rc2
Thanks for that. Do you have a timeframe for when the official version will be released?
If we don't find any more bugs and if we can get all of the existing bugs fixed and the third release candidate into code freeze and fully tested in time... Monday? But that is seeming unlikely at this point. 😸
Version of Singularity:
3.0.1-73.ga50b5b3.el7
Expected behavior
the container runs the command. This particular container and Singularity works fine on our system and just fails for the example described here.
Actual behavior
An error occurs: Failed to read JSON configuration from socket: Bad file descriptor
Steps to reproduce behavior
I believe this issue happens because of the extremely long command. In my example, the command that fails in a reproducible manner with the error message above has over 30,000 characters (essentially a list of hundreds of file path that collectively form one command).
The error happens here (see here, line 758):