NeuroDesk / neurocontainers

The containers can be used in combination with our transparent singularity or neurocommand tool, that wrap the executables inside a container to make them easily available for pipelines
https://www.neurodesk.org
MIT License
20 stars 52 forks source link

New container relion 4.0.1.sm61 #620

Closed vnm-neurodesk closed 2 months ago

vnm-neurodesk commented 7 months ago

There is a new container by @stebo85, use this command to test:

bash /neurocommand/local/fetch_and_run.sh relion 4.0.1.sm61 20240318

If test was successful, then add to apps.json to release: https://github.com/NeuroDesk/neurocommand/edit/main/neurodesk/apps.json

Please close this issue when completed :)

stebo85 commented 7 months ago

@vennand - could you test this container and see if it all works as expected?

vennand commented 7 months ago

How do I transfer data to the neurodesktop to test? It opens as expected, but I need to launch a job to know if it'll work.

stebo85 commented 7 months ago

Are you running neurodesktop locally in docker? If yes, you have a shared directory between the desktop and the host.

Alternatively you can drag and drop files on the desktop and guacamole will upload the file (has to be one file, can't be a directory)

vennand commented 7 months ago

I'm trying locally in docker, and I just noticed the directory, thanks!

Is it possible to do a GPU passthrough with the local docker? I'm pretty sure I won't be able to test if the GPU settings work otherwise. Though so far, there was no error message saying it was CPU only.

Though I'm not convinced it compiled with GPU support if the machine that built the container didn't have a GPU. With the new version of relion (ver5.0), they explicitly state that the compiler tries to detect a GPU, and if not, compiles for CPU only, even if the a GPU architecture is provided.

stebo85 commented 7 months ago

Dear @vennand

yes, you can pass your GPU into the docker container:

sudo docker run \
  --shm-size=1gb -it --privileged --user=root --name neurodesktop \
  -v ~/neurodesktop-storage:/neurodesktop-storage \
  -e NB_UID="$(id -u)" -e NB_GID="$(id -g)" \
  --gpus all \
  -p 8888:8888 -e NEURODESKTOP_VERSION=2024-01-12 \
  vnmd/neurodesktop:2024-01-12

to check if it worked, run nvidia-smi in the desktop container afterwards

that would be annoying if it needs a GPU to compile. We do not have the ability to run a GPU node for building containers.

vennand commented 7 months ago

We might just be limited to version 4 for now then. As far as I can tell, version 5 is still in beta, so it might not be advisable to use it for research anyway.

I tried running your command, but I get the following error message docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]. ERRO[0000] error waiting for container: context canceled

Didn't find anything relevant with a very quick Google. Any idea what could cause this?

Also, I won't be able to touch this until the 16th of April unfortunately, but I plan on getting back to it.

stebo85 commented 7 months ago

Dear @vennand

Did you install the nvidia-container-toolkit beforehand?

#RHEL/CentOS (yum-based)
sudo yum install nvidia-container-toolkit -y
#Ubuntu/Debian (apt-based)
sudo apt install nvidia-container-toolkit -y
vennand commented 7 months ago

I had not, but I get the same error after installing it

stebo85 commented 7 months ago

what are you getting when you run nvidia-smi on your host system?

vennand commented 7 months ago

`+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla P40 Off | 00000000:01:00.0 Off | Off | | N/A 18C P8 9W / 250W | 4MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1536 G /usr/lib/xorg/Xorg 4MiB | +---------------------------------------------------------------------------------------+`

stebo85 commented 7 months ago

can you try this? https://www.howtogeek.com/devops/how-to-use-an-nvidia-gpu-with-docker-containers/

It needs a restart restart of the docker daemon and potentially apt-get install -y nvidia-docker2

vennand commented 7 months ago

I've installed nvidia-docker2, but I've also ran this sudo nvidia-ctk runtime configure --runtime=docker

I don't know which one worked, but it worked. I'll try to test it now, but I don't know if I'll have time

vennand commented 5 months ago

Hi @stebo85,

I've finished testing. Relion works as intended, but none of the jobs showed up when running "nvidia-smi", even though we could see the GPU being used. Not sure if that's an issue with the GPU passthrough, but it is using the GPU.

Another important issue is that one of the third party software I install along with relion doesn't work. Basically, CTFFIND 4.1.14 fails if it's compiled with GCC 8 or above. The fix I've found is to modify the code, which doesn't seem practical or elegant to do in the neurodesk script. What would be the best approach around this? Should I host a "fixed" copy of the code on my own Github? (though I'm not sure if the license agreement allows this)

stebo85 commented 5 months ago

Dear @vennand, which command did you use for testing the GPUs? I have seen a similar behaviour once using the old flag. Can you try with --gpus all ? Another check: what comes up when you run which nvidia-smi?

Fixing a software live for a container is a tricky one. I have done various things in the past depending on the project: 1) apply an sed command that fixes a few single lines in the neurocontainer buildscript - would that work for you? 2) provide a fixed sourecode file in the neurocontainers repository along with the build script and copy it into the container during build to overwrite the upstream file 3) fork the software and then fix it there and use the fix inside the container + provide the fix upstream in the hope they merge it.

vennand commented 5 months ago

@stebo85

To test the GPU, I simply watched nvidia-smi (watch -n 1 nvidia-smi) while running relion. Relion launches python scripts that show there normally. They didn't in the VM, but they were listed on the main machine (the one I'm running neurodesk from).

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P40                      Off | 00000000:01:00.0 Off |                  Off |
| N/A   31C    P0              74W / 250W |  24256MiB / 24576MiB |     67%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1632      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A      7835      C   ...relion-4.0.1.sm61/bin/relion_refine    24250MiB |
+---------------------------------------------------------------------------------------+

I don't know exactly how the code accesses the GPU, but I can probably find out if that's relevant.

When I run which nvidia-smi I get /usr/bin/nvidia-smi

Regarding fixing the software, I think I'll go with option 2, since the source code is only 11MB. Do you want me to push the fix now, or should we investigate the GPU "issue" before?

stebo85 commented 5 months ago

Interesting. I don't know what causes this behaviour, but I guess if it works it works no matter where the GPU tasks show up.

Happy for you to push the fix now :) Let's see if we can get this work!

vennand commented 5 months ago

@stebo85 Would you know what this error means?

$ bash build.sh -ds
Entering Debug mode
WARNING: Skipping neurodocker as it is not installed.
Defaulting to user installation because normal site-packages is not writeable
Collecting https://github.com/ReproNim/neurodocker/tarball/master
  Downloading https://github.com/ReproNim/neurodocker/tarball/master
     - 77.3 kB 10.0 MB/s 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [36 lines of output]
      /tmp/pip-build-env-aeraeba5/overlay/local/lib/python3.10/dist-packages/setuptools_scm/_integration/setuptools.py:31: RuntimeWarning:
      ERROR: setuptools==59.6.0 is used in combination with setuptools_scm>=8.x

      Your build configuration is incomplete and previously worked by accident!
      setuptools_scm requires setuptools>=61

      Suggested workaround if applicable:
       - migrating from the deprecated setup_requires mechanism to pep517/518
         and using a pyproject.toml to declare build dependencies
         which are reliably pre-installed before running the build tools

        warnings.warn(
      Traceback (most recent call last):
        File "/usr/lib/python3/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
          main()
        File "/usr/lib/python3/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/usr/lib/python3/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 164, in prepare_metadata_for_build_wheel
          return hook(metadata_directory, config_settings)
        File "/tmp/pip-build-env-aeraeba5/overlay/local/lib/python3.10/dist-packages/hatchling/build.py", line 112, in prepare_metadata_for_build_wheel
          directory = os.path.join(metadata_directory, f'{builder.artifact_project_id}.dist-info')
        File "/tmp/pip-build-env-aeraeba5/overlay/local/lib/python3.10/dist-packages/hatchling/builders/wheel.py", line 825, in artifact_project_id
          self.project_id
        File "/tmp/pip-build-env-aeraeba5/overlay/local/lib/python3.10/dist-packages/hatchling/builders/plugin/interface.py", line 374, in project_id
          self.__project_id = f'{self.normalize_file_name_component(self.metadata.core.name)}-{self.metadata.version}'
        File "/tmp/pip-build-env-aeraeba5/overlay/local/lib/python3.10/dist-packages/hatchling/metadata/core.py", line 149, in version
          self._version = self._get_version()
        File "/tmp/pip-build-env-aeraeba5/overlay/local/lib/python3.10/dist-packages/hatchling/metadata/core.py", line 248, in _get_version
          version = self.hatch.version.cached
        File "/tmp/pip-build-env-aeraeba5/overlay/local/lib/python3.10/dist-packages/hatchling/metadata/core.py", line 1466, in cached
          raise type(e)(message) from None
      LookupError: Error getting the version from source `vcs`: setuptools-scm was unable to detect version for /tmp/pip-req-build-wu94yd8o.

      Make sure you're either building from a fully intact git repository or PyPI tarballs. Most other sources (such as GitHub's tarballs, a git checkout without the .git folder) don't contain the necessary metadata and will not work.

      For example, if you're using pip, instead of https://github.com/user/proj/archive/master.zip use git+https://github.com/user/proj.git#egg=proj
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
stebo85 commented 5 months ago

Yes, you need to update the GitHub url of neurodocker:

For example, if you're using pip, instead of https://github.com/user/proj/archive/master.zip use git+https://github.com/user/proj.git#egg=proj [end of output]

Thank you

Steffen

--

W: https://mri.sbollmann.nethttps://mri.sbollmann.net/ | W: https://www.neurodesk.orghttps://www.neurodesk.org/ | T: https://twitter.com/sbollmann_MRI | G: https://github.com/stebo85

Book meeting: https://calendly.com/s-bollmann/meeting


From: Andre Venne @.> Sent: Monday, May 27, 2024 9:15:30 AM To: NeuroDesk/neurocontainers @.> Cc: Steffen Bollmann @.>; Mention @.> Subject: Re: [NeuroDesk/neurocontainers] New container relion 4.0.1.sm61 (Issue #620)

@stebo85https://github.com/stebo85 Would you know what this error means?

$ bash build.sh -ds Entering Debug mode WARNING: Skipping neurodocker as it is not installed. Defaulting to user installation because normal site-packages is not writeable Collecting https://github.com/ReproNim/neurodocker/tarball/master Downloading https://github.com/ReproNim/neurodocker/tarball/master

× Encountered error while generating package metadata. ╰─> See above for output.

note: This is an issue with the package mentioned above, not pip. hint: See above for details.

— Reply to this email directly, view it on GitHubhttps://github.com/NeuroDesk/neurocontainers/issues/620#issuecomment-2132799224, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA6V2W2Z6DSY4RYJNEGIPO3ZELMRFAVCNFSM6AAAAABE3BMHXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZSG44TSMRSGQ. You are receiving this because you were mentioned.Message ID: @.***>

vennand commented 4 months ago

@stebo85

Hey, I'm back working on this. I'll start implementing the other software soon.

But first, I tested this version of relion on our other GPUs, and it runs without issues. Perhaps the default setting (sm35) is too old, but this one works. I'm thinking it would be simpler for users to only package this one. If you think this could be a good idea, how do we go about this? Only put this one in the JSON, with Exec: relion?

stebo85 commented 4 months ago

Great to hear that Relion is working :)

ok, makes sense that the newer version works better. CUDA is usually quite backwards compatible, so if you have fairly current driver versions that makes sense.

Yes, put the version you found working best in the apps.json and this will trigger the release process.

Thank you for getting this to work !!!