Closed GodloveD closed 7 years ago
Hi,
Could you send me the core dump off-list? Maybe as a PM on slack if it's too large for email?
Brian
Thanks very much Brian. I just PM-ed you.
On Tue, Jan 10, 2017 at 3:25 PM, Brian Bockelman notifications@github.com wrote:
Hi,
Could you send me the core dump off-list? Maybe as a PM on slack if it's too large for email?
Brian
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/singularityware/singularity/issues/452#issuecomment-271688010, or mute the thread https://github.com/notifications/unsubscribe-auth/AHUUXKJYScfMKapZwoNKxFCwaY2wuf1nks5rQ-lRgaJpZM4Lf2qR .
Update from private discussion. The kernel traceback (note - RHEL6) indicates that the kernel is panicking when the following sequence happens:
lockd
thread has a null pointer dereference when unlocking.@GodloveD - if I gave you a patch that would disable the locking singularity performs, would you be able to test it? Would you want the patch against the master
or a specific revision?
Sure. I can have a look. master
should be fine.
do you have any update from Red Hat on the issue? I
The lib-refactor branch doesn't do locking of the loop device cache file anymore (but it does flock() the session directory still). If possible, can you test the lib-refactor branch too?
note: be sure to install to different prefixes, as if there are any shared libraries or objects around, it could cause headaches.
Just installed and tested the lib-refactor branch (https://github.com/singularityware/singularity/commit/3979bba1c8c595d6c04798c1a7a313dd5bdfd63c). I can verify that it does NOT cause a kernel panic with the copy
command :smile_cat: And I verified that the latest version of master (https://github.com/singularityware/singularity/commit/6007d3a1a46790bf1342eb99e2ac0aedd2e9cdb1) DOES still cause a kernel panic with the copy
command.
I also tried the create
, bootstrap
, exec
, export
and import
commands. None of these commands caused a kernel panic but I did run into a problem with the export
/import
commands. An edited session transcript follows:
$ singularity create test.img
$ singularity bootstrap test.img singularity/examples/centos.def
$ echo wutini! > jawa.sez
$ singularity copy test.img jawa.sez /
$ singularity exec test.img cat /jawa.sez
wutini!
$ singularity create -s 500 test2.img
$ singularity export test.img | singularity import test2.img
ERROR : Failed to exec program /usr/bin/tar: No such file or directory
ABORT : Retval = 255
ERROR : Tar did not return successful
ERROR : Failed to exec program /usr/bin/tar: No such file or directory
ABORT : Retval = 255
ERROR : Tar did not return successful
$ which tar
/bin/tar
This error was encountered on a Centos6 compute node in the Biowulf cluster. When I installed the same lib-factor branch (https://github.com/singularityware/singularity/commit/6007d3a1a46790bf1342eb99e2ac0aedd2e9cdb1) on a Google cloud VM running Ubuntu 16.04 I was unable to replicate the bug:
$ sudo singularity export ubuntu.img | sudo singularity import test1.img
Assuming import from incoming pipe
Bootstrap initialization
No bootstrap definition passed, updating container
Executing Prebootstrap module
Executing Postbootstrap module
Done.
So it seems the kernel panic bug is fixed but I may have exposed a new bug. Should a raise a new issue or is this not a problem?
^Tested and confirmed! Feel free to close.
Rockin, thanks @GodloveD!
The following command reliably produces a kernel panic on the NIH HPC Biowulf cluster. (3 tests on 2 different nodes on 2 separate days with 2 different versions of singularity [2.2 and latest master]).
Testing on a Google VM with the exact same kernel does not produce a panic. Here is some basic info from one of the the crash dumps. I can provide more if someone can tell me what kind of info would be useful.