apptainer / singularity

Singularity has been renamed to Apptainer as part of us moving the project to the Linux Foundation. This repo has been persisted as a snapshot right before the changes.
https://github.com/apptainer/apptainer
Other
2.54k stars 426 forks source link

copy command results in kernel panic #452

Closed GodloveD closed 7 years ago

GodloveD commented 7 years ago

The following command reliably produces a kernel panic on the NIH HPC Biowulf cluster. (3 tests on 2 different nodes on 2 separate days with 2 different versions of singularity [2.2 and latest master]).

$ sudo singularity copy some.img /some/file /some/location

Testing on a Google VM with the exact same kernel does not produce a panic. Here is some basic info from one of the the crash dumps. I can provide more if someone can tell me what kind of info would be useful.

      KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.3.1.el6.x86_64/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2017-01-09-14:27:28/vmcore  [PARTIAL DUMP]
        CPUS: 56
        DATE: Mon Jan  9 14:27:13 2017
      UPTIME: 00:02:59
LOAD AVERAGE: 0.46, 0.47, 0.20
       TASKS: 1407
    NODENAME: cn1129
     RELEASE: 2.6.32-642.3.1.el6.x86_64
     VERSION: #1 SMP Tue Jul 12 18:30:56 UTC 2016
     MACHINE: x86_64  (2294 Mhz)
      MEMORY: 255.9 GB
       PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000008"
         PID: 26201
     COMMAND: "image-mount"
        TASK: ffff88204ee36ab0  [THREAD_INFO: ffff8820283b4000]
         CPU: 33
       STATE: TASK_RUNNING (PANIC)
bbockelm commented 7 years ago

Hi,

Could you send me the core dump off-list? Maybe as a PM on slack if it's too large for email?

Brian

GodloveD commented 7 years ago

Thanks very much Brian. I just PM-ed you.

On Tue, Jan 10, 2017 at 3:25 PM, Brian Bockelman notifications@github.com wrote:

Hi,

Could you send me the core dump off-list? Maybe as a PM on slack if it's too large for email?

Brian

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/singularityware/singularity/issues/452#issuecomment-271688010, or mute the thread https://github.com/notifications/unsubscribe-auth/AHUUXKJYScfMKapZwoNKxFCwaY2wuf1nks5rQ-lRgaJpZM4Lf2qR .

bbockelm commented 7 years ago

Update from private discussion. The kernel traceback (note - RHEL6) indicates that the kernel is panicking when the following sequence happens:

  1. Last process exits from namespace.
  2. Kernel cleans up all unnecessary mounts.
  3. Block device closes, meaning kernel starts to clear up loopback.
  4. Loopback cleanup code removes all existing locks on device.
  5. Lock on device is kept on NFS server, so control passes to NFS code.
  6. NFS's lockd thread has a null pointer dereference when unlocking.
bbockelm commented 7 years ago

@GodloveD - if I gave you a patch that would disable the locking singularity performs, would you be able to test it? Would you want the patch against the master or a specific revision?

GodloveD commented 7 years ago

Sure. I can have a look. master should be fine.

truatpasteurdotfr commented 7 years ago

do you have any update from Red Hat on the issue? I

gmkurtzer commented 7 years ago

The lib-refactor branch doesn't do locking of the loop device cache file anymore (but it does flock() the session directory still). If possible, can you test the lib-refactor branch too?

note: be sure to install to different prefixes, as if there are any shared libraries or objects around, it could cause headaches.

GodloveD commented 7 years ago

Just installed and tested the lib-refactor branch (https://github.com/singularityware/singularity/commit/3979bba1c8c595d6c04798c1a7a313dd5bdfd63c). I can verify that it does NOT cause a kernel panic with the copy command :smile_cat: And I verified that the latest version of master (https://github.com/singularityware/singularity/commit/6007d3a1a46790bf1342eb99e2ac0aedd2e9cdb1) DOES still cause a kernel panic with the copy command.

I also tried the create, bootstrap, exec, export and import commands. None of these commands caused a kernel panic but I did run into a problem with the export/import commands. An edited session transcript follows:

$ singularity create test.img
$ singularity bootstrap test.img singularity/examples/centos.def
$ echo wutini! > jawa.sez
$ singularity copy test.img jawa.sez /
$ singularity exec test.img cat /jawa.sez
wutini!
$ singularity create -s 500 test2.img
$ singularity export test.img | singularity import test2.img
ERROR  : Failed to exec program /usr/bin/tar: No such file or directory
ABORT  : Retval = 255
ERROR  : Tar did not return successful
ERROR  : Failed to exec program /usr/bin/tar: No such file or directory
ABORT  : Retval = 255
ERROR  : Tar did not return successful
$ which tar
/bin/tar

This error was encountered on a Centos6 compute node in the Biowulf cluster. When I installed the same lib-factor branch (https://github.com/singularityware/singularity/commit/6007d3a1a46790bf1342eb99e2ac0aedd2e9cdb1) on a Google cloud VM running Ubuntu 16.04 I was unable to replicate the bug:

$ sudo singularity export ubuntu.img | sudo singularity import test1.img
Assuming import from incoming pipe
Bootstrap initialization
No bootstrap definition passed, updating container
Executing Prebootstrap module
Executing Postbootstrap module
Done.

So it seems the kernel panic bug is fixed but I may have exposed a new bug. Should a raise a new issue or is this not a problem?

gmkurtzer commented 7 years ago

Fixed here:

https://github.com/singularityware/singularity/commit/bc6101233c1396820a79386e7b0092b95e5d83a4

GodloveD commented 7 years ago

^Tested and confirmed! Feel free to close.

gmkurtzer commented 7 years ago

Rockin, thanks @GodloveD!