Closed mikemhenry closed 3 years ago
I wonder if using bash in login mode (like this commit 4b7c21b) would help?
Merging #24 (5371030) into master (48ef2b3) will decrease coverage by
0.22%
. The diff coverage isn/a
.
@@ Coverage Diff @@
## master #24 +/- ##
==========================================
- Coverage 98.78% 98.56% -0.23%
==========================================
Files 5 5
Lines 165 139 -26
==========================================
- Hits 163 137 -26
Misses 2 2
Impacted Files | Coverage Δ | |
---|---|---|
planckton/sim.py | 96.77% <0.00%> (-0.37%) |
:arrow_down: |
planckton/init.py | 100.00% <0.00%> (ø) |
|
planckton/utils/units.py | ||
planckton/utils/solvate.py | ||
planckton/utils/unit_conversions.py | 100.00% <0.00%> (ø) |
|
planckton/utils/base_units.py | 100.00% <0.00%> (ø) |
I wonder if using bash in login mode (like this commit 4b7c21b) would help?
That might help, I've updated this PR and I think it will work now, let me know what questions you have about the dockerfile.
I've pushed the image, see if you can check if the GPU is working singularity pull docker://cmelab/planckton-gpu:dev
I tried on bridges but ran into some weirdness that I'm not sure is from the image.
I've got a ticket open with XSEDE people, but if anyone wants to test, here are some steps on bridges:
[mhenry@login005 mhenry]$ interact -p GPU-small --gres=gpu:p100:1
[mhenry@gpu048 mhenry]$ nvidia-smi
Tue Nov 17 13:30:28 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:87:00.0 Off | 0 |
| N/A 26C P0 27W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[mhenry@gpu048 mhenry]$ singularity shell --nv planckton-gpu_dev.sif
Singularity> /opt/conda/envs/planckton/bin/python
Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hoomd
>>> hoomd.context.initialize("--mode=gpu")
HOOMD-blue v2.9.3 CUDA (11.1) SINGLE SSE SSE2
Compiled: 11/17/20
Copyright (c) 2009-2019 The Regents of the University of Michigan.
-----
You are using HOOMD-blue. Please cite the following:
* J A Anderson, J Glaser, and S C Glotzer. "HOOMD-blue: A Python package for
high-performance molecular dynamics and hard particle Monte Carlo
simulations", Computational Materials Science 173 (2020) 109363
-----
initialization error
**ERROR**: No capable GPUs were found!
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/envs/planckton/lib/python3.7/site-packages/hoomd/context.py", line 249, in initialize
exec_conf = _create_exec_conf(mpi_conf, msg, options);
File "/opt/conda/envs/planckton/lib/python3.7/site-packages/hoomd/context.py", line 375, in _create_exec_conf
exec_conf = _hoomd.ExecutionConfiguration(exec_mode, gpu_vec, options.min_cpu, options.ignore_display, mpi_conf, msg);
RuntimeError: Error initializing execution configuration
>>>
Singularity> exit
Not sure if the issue getting a GPU is on me or on bridges, I will keep troubleshooting this.
I'm having trouble even pulling the image on bridges
[jfoth@login018 ~]$ singularity pull docker://cmelab/planckton-gpu:dev
INFO: Using cached SIF image
FATAL: While making image from oci registry: error copying image out of cache: could not copy file: write tmp-copy-617059245: disk quota exceeded
Is there something weird with my disk space allowance? I hardly have anything on bridges. I am in my home dir.
testing on Fry:
$ module load singularity
$ singularity pull docker://cmelab/planckton-gpu:dev
$ srun -p volta --pty bash
(base) [jennyfothergill@node16 ~]$ nvidia-smi
Tue Nov 17 14:26:01 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:81:00.0 Off | 0 |
| N/A 34C P0 35W / 250W | 0MiB / 16160MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ singularity shell --nv planckton-gpu_dev.sif
Singularity planckton-gpu_dev.sif:~> which python
/opt/conda/bin/python
$ /opt/conda/envs/planckton/bin/python
Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hoomd
>>> hoomd.context.initialize("--mode=gpu")
HOOMD-blue v2.9.3 CUDA (11.1) SINGLE SSE SSE2
Compiled: 11/17/20
Copyright (c) 2009-2019 The Regents of the University of Michigan.
-----
You are using HOOMD-blue. Please cite the following:
* J A Anderson, J Glaser, and S C Glotzer. "HOOMD-blue: A Python package for
high-performance molecular dynamics and hard particle Monte Carlo
simulations", Computational Materials Science 173 (2020) 109363
-----
unknown error
**ERROR**: No capable GPUs were found!
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/envs/planckton/lib/python3.7/site-packages/hoomd/context.py", line 249, in initialize
exec_conf = _create_exec_conf(mpi_conf, msg, options);
File "/opt/conda/envs/planckton/lib/python3.7/site-packages/hoomd/context.py", line 375, in _create_exec_conf
exec_conf = _hoomd.ExecutionConfiguration(exec_mode, gpu_vec, options.min_cpu, options.ignore_display, mpi_conf, msg);
RuntimeError: Error initializing execution configuration
same issue on Fry
could it be because the gpu has CUDA Version: 10.2
but hoomd is compiled against HOOMD-blue v2.9.3 CUDA (11.1) SINGLE SSE SSE2
?
I will try rolling back to a lower cuda version, but what really matters is driver is compatible with the CUDA version. On bridges you should cd $SCRATCH
and do singularity image stuff there since you only have a 10gb quota in your home folder.
@mikemhenry is there anything I can do to help with this PR?
I'm a lot better at writing docker files now so I've got some ideas to make this MUCH better. But before we work on that, I want to review the requirements and what the need is.
Is this container going to be used on HPC resources? What version of HOOMD do we want?
The immediate goal is to get this repo where everyone can easily spin up simulations on a cluster. So, yes to HPC resources. (Fry and XSEDE for sure) Eventually I want to update to hoomd v3, but v2.9 is working now. I think first gpu support for v2.9(.3? I think) would be great.
container with cuda and conda https://hub.docker.com/r/kundajelab/cuda-anaconda-base/
This is close to working, a few issues:
podman
to build the image, it used the planckton env python when building hoomd :heavy_check_mark: but when I used docker, it linked hoomd to the base env :no_good: so will need to investigate thatI'm going to first try getting things to work locally with a GPU using a hello world image first before I really start troubleshooting these issues.