Closed twinkarma closed 6 years ago
@willfurnass @mondus Looks like it's just stopped working suddenly, maybe it's due to driver updates? Drivers should have backwards compatibility to CUDA code though.
@twinkarma How can one reproduce this issue?
@willfurnass looks like it'll fail on any model. I've just updated my local machine to driver version 384.90 and now getting the same errors.
@twinkarma Will enough people encounter the issue to warrant notifying the HPC and/or RSE community mailing lists?
@willfurnass I've gone and installed singularity 2.4.2 on my machine and looks like the --nv flag is properly working now and it seems to also fix this problem. I'm recommending an update of singularity version, removing the config files and recommending everyone to use the --nv flag instead.
... After testing that it works
@willfurnass I just mean updating the documentation, definitely no need for a group wide email about it.
@AnthonyBrookfield: FYI
Great. It'll be good to be using the same mechanism as other Singularity users to make GPUs usable in containers.
For reference, here's Greg Kurtzer's explaination of the --nv
flag:
Singularity (now in development branch) supports the
--nv
flag which will find the relevant Nvidia/Cuda libraries on your host via the ld.so.cache, and will bind those into a library location within the container automatically. It will also make sure those libraries are linked, as necessary, by any Cuda applications that require it. Additionally, with the device tree bound into the container, all of the components for appropriate application runtime support are present, and as has been tested, it just works. :)
I've got rpms for 2.4.2 built and ready to go... Do you want to test on a GPU node before I roll them out? (I'll be removing the bind path options for nvlib and nvbin from the singularity config file)
@anthonybrookfield Yes I'd like to test it first before we have to roll it out.
@anthonybrookfield Any updates on this?
I've put the new version on the flybrain nodes (still using the old config file though)
I'm getting the following error:
WARNING: Skipping user bind, non existent bind point (file) in container: '/bin/nvidia-smi'
Looks like we have to enable overlay (#singularityware/singularity/624) in the config file:
enable overlay = yes
#OR try and silently ignore if can't bind
enable overlay = try
@anthonybrookfield Hi Anthony, any updates on this?
@anthonybrookfield Hi Anthony, any way we can move forwards with this?
Needed to track down the reason for the scary warning in the config file when enabling overlay.
note: currently disabled because RHEL7 kernel crashes with it... :(
Looks like it should be fine for RHEL >7.2 (https://github.com/singularityware/singularity/issues/228) so I've updated the config file to enable it. The flybrain nodes should now have "enable overlay = try", the rest of the cluster will pick up the change in the next couple of hours.
Sorry for the delay, just had a chance to test this. Looks like it works! We just need to remove the /nvbin and /nvlib binding points from the images as it's now not necessary and interfere with the library that Singularity imports automatically.
I will get to work on updating the docs.
Remove the /nvbin and /nvlib binding points from the configs rather.
OK - let me know when the docs are ready and then I'll remove the configs for those bind points. should be a quick change to make. Do you know if there are many users using the old method who will need informing of the change?
I've updated the singularity instructions now. See PR rcgsheffield/sheffield_hpc#776
singularity.conf file now updated. I'd suggest maybe posting to the HPC discussion group just to warn people that the old method won't work any more and pointing at the docs...
Ok I'll do that by today.
This is the output from test mnist model which failed on the Conv layer but other users have reported failing on ReLU as well for different models.