mikemhenry commented 3 years ago

This is close to working, a few issues:

CUDA 11.1 is pretty new and requires a newer driver than might appear on HPC systems
When I used podman to build the image, it used the planckton env python when building hoomd :heavy_check_mark: but when I used docker, it linked hoomd to the base env :no_good: so will need to investigate that

I'm going to first try getting things to work locally with a GPU using a hello world image first before I really start troubleshooting these issues.

jennyfothergill commented 3 years ago

I wonder if using bash in login mode (like this commit 4b7c21b) would help?

codecov[bot] commented 3 years ago

Codecov Report

Merging #24 (5371030) into master (48ef2b3) will decrease coverage by 0.22%. The diff coverage is n/a.

@@            Coverage Diff             @@
##           master      #24      +/-   ##
==========================================
- Coverage   98.78%   98.56%   -0.23%     
==========================================
  Files           5        5              
  Lines         165      139      -26     
==========================================
- Hits          163      137      -26     
  Misses          2        2

Impacted Files	Coverage Δ
planckton/sim.py	`96.77% <0.00%> (-0.37%)`	:arrow_down:
planckton/init.py	`100.00% <0.00%> (ø)`
planckton/utils/units.py
planckton/utils/solvate.py
planckton/utils/unit_conversions.py	`100.00% <0.00%> (ø)`
planckton/utils/base_units.py	`100.00% <0.00%> (ø)`

mikemhenry commented 3 years ago

I wonder if using bash in login mode (like this commit 4b7c21b) would help?

That might help, I've updated this PR and I think it will work now, let me know what questions you have about the dockerfile.

mikemhenry commented 3 years ago

I've pushed the image, see if you can check if the GPU is working singularity pull docker://cmelab/planckton-gpu:dev I tried on bridges but ran into some weirdness that I'm not sure is from the image.

mikemhenry commented 3 years ago

I've got a ticket open with XSEDE people, but if anyone wants to test, here are some steps on bridges:

[mhenry@login005 mhenry]$ interact -p GPU-small --gres=gpu:p100:1
[mhenry@gpu048 mhenry]$ nvidia-smi
Tue Nov 17 13:30:28 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   26C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[mhenry@gpu048 mhenry]$ singularity shell --nv planckton-gpu_dev.sif 
Singularity> /opt/conda/envs/planckton/bin/python
Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hoomd
>>> hoomd.context.initialize("--mode=gpu")
HOOMD-blue v2.9.3 CUDA (11.1) SINGLE SSE SSE2 
Compiled: 11/17/20
Copyright (c) 2009-2019 The Regents of the University of Michigan.
-----
You are using HOOMD-blue. Please cite the following:
* J A Anderson, J Glaser, and S C Glotzer. "HOOMD-blue: A Python package for
  high-performance molecular dynamics and hard particle Monte Carlo
  simulations", Computational Materials Science 173 (2020) 109363
-----
initialization error
**ERROR**: No capable GPUs were found!
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/planckton/lib/python3.7/site-packages/hoomd/context.py", line 249, in initialize
    exec_conf = _create_exec_conf(mpi_conf, msg, options);
  File "/opt/conda/envs/planckton/lib/python3.7/site-packages/hoomd/context.py", line 375, in _create_exec_conf
    exec_conf = _hoomd.ExecutionConfiguration(exec_mode, gpu_vec, options.min_cpu, options.ignore_display, mpi_conf, msg);
RuntimeError: Error initializing execution configuration
>>> 
Singularity> exit

Not sure if the issue getting a GPU is on me or on bridges, I will keep troubleshooting this.

jennyfothergill commented 3 years ago

I'm having trouble even pulling the image on bridges

[jfoth@login018 ~]$ singularity pull docker://cmelab/planckton-gpu:dev
INFO:    Using cached SIF image
FATAL:   While making image from oci registry: error copying image out of cache: could not copy file: write tmp-copy-617059245: disk quota exceeded

Is there something weird with my disk space allowance? I hardly have anything on bridges. I am in my home dir.

jennyfothergill commented 3 years ago

testing on Fry:

$ module load singularity
$ singularity pull docker://cmelab/planckton-gpu:dev
$ srun -p volta --pty bash
(base) [jennyfothergill@node16 ~]$ nvidia-smi
Tue Nov 17 14:26:01 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:81:00.0 Off |                    0 |
| N/A   34C    P0    35W / 250W |      0MiB / 16160MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

$ singularity shell --nv planckton-gpu_dev.sif
Singularity planckton-gpu_dev.sif:~> which python
/opt/conda/bin/python
$ /opt/conda/envs/planckton/bin/python
Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hoomd
>>> hoomd.context.initialize("--mode=gpu")
HOOMD-blue v2.9.3 CUDA (11.1) SINGLE SSE SSE2 
Compiled: 11/17/20
Copyright (c) 2009-2019 The Regents of the University of Michigan.
-----
You are using HOOMD-blue. Please cite the following:
* J A Anderson, J Glaser, and S C Glotzer. "HOOMD-blue: A Python package for
  high-performance molecular dynamics and hard particle Monte Carlo
  simulations", Computational Materials Science 173 (2020) 109363
-----
unknown error
**ERROR**: No capable GPUs were found!
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/planckton/lib/python3.7/site-packages/hoomd/context.py", line 249, in initialize
    exec_conf = _create_exec_conf(mpi_conf, msg, options);
  File "/opt/conda/envs/planckton/lib/python3.7/site-packages/hoomd/context.py", line 375, in _create_exec_conf
    exec_conf = _hoomd.ExecutionConfiguration(exec_mode, gpu_vec, options.min_cpu, options.ignore_display, mpi_conf, msg);
RuntimeError: Error initializing execution configuration

same issue on Fry

jennyfothergill commented 3 years ago

could it be because the gpu has CUDA Version: 10.2 but hoomd is compiled against HOOMD-blue v2.9.3 CUDA (11.1) SINGLE SSE SSE2?

mikemhenry commented 3 years ago

I will try rolling back to a lower cuda version, but what really matters is driver is compatible with the CUDA version. On bridges you should cd $SCRATCH and do singularity image stuff there since you only have a 10gb quota in your home folder.

jennyfothergill commented 3 years ago

@mikemhenry is there anything I can do to help with this PR?

mikemhenry commented 3 years ago

I'm a lot better at writing docker files now so I've got some ideas to make this MUCH better. But before we work on that, I want to review the requirements and what the need is.

Is this container going to be used on HPC resources? What version of HOOMD do we want?

jennyfothergill commented 3 years ago

The immediate goal is to get this repo where everyone can easily spin up simulations on a cluster. So, yes to HPC resources. (Fry and XSEDE for sure) Eventually I want to update to hoomd v3, but v2.9 is working now. I think first gpu support for v2.9(.3? I think) would be great.

jennyfothergill commented 3 years ago

container with cuda and conda https://hub.docker.com/r/kundajelab/cuda-anaconda-base/

cmelab / planckton

Create a GPU image #24

Codecov Report