Closed Cam-Wheeler closed 5 months ago
Hi @Cam-Wheeler, probably my most favorite issue so far. Many thanks for the elaborate steps - really helps me understand the issue.
From the first glance, it seems that the nnUNet paths are not set, but then you had mentioned that output_dir inside the MOOSE-xxx folder has the JSON files that will only be created when the predict is run/initiated. Either the process is killed or the way we set up paths for nnunet via os.environment is not the right way for docker based runs. But that doesn't explain the jsons being created.
So I am not sure what exactly is happening. And we haven't tested moosev2 with docker - which should have been done. May be I take this opportunity to do that.
I have a follow up question - do you see the gpu being utilized? When you run the docker image? This would help me get an idea. Also can you share the DockerFile you used to create the image, so that I can reproduce the error in my server? Also the docker run command that you used to run the image.
Don't worry - we will get this going asap so that you have no hassles with your dissertarion! :)
Cheers, Lalith
@mprires @Keyn34 adding you both since this is interesting 🤨
Hi @Cam-Wheeler, can you navigate to the environment where you installed moosez
, search for a models folder there, and post a screenshot of its contents, please?
It says, Could not find a dataset with the ID 333
, so it might be that docker does not allow the connection to our AWS repository for the models.
Hello @LalithShiyam!! Thanks for the response, it's much appreciated!
Okay, here is the Dockerfile:
# Install base image and set shell.
FROM --platform=linux/amd64 nvcr.io/nvidia/cuda:12.0.0-cudnn8-devel-ubuntu22.04
SHELL ["/bin/bash", "-c"]
#### Conda Env ####
# Download and install Miniconda3.
RUN apt update && apt upgrade -y
RUN apt install -y tree wget curl git
RUN mkdir -p /opt/conda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /opt/conda/miniconda.sh
RUN chmod +x /opt/conda/miniconda.sh
RUN bash /opt/conda/miniconda.sh -b -u -p /opt/miniconda3
RUN rm -rf /opt/conda/miniconda.sh
# Update our path so that we can access conda.
ENV PATH="/opt/miniconda3/bin:$PATH"
RUN conda init bash
RUN conda update -n base -c defaults conda -y
RUN pip install --upgrade pip
#### Dependencies ####
# Conda packages
RUN conda install python=3.10 -y
RUN conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
RUN conda install numpy pandas matplotlib -y
# Pip only packages
RUN pip install moosez pydicom
#### Source Code ####
# Set the working directory.
WORKDIR /app
COPY runners /app/runners
COPY src /app/src
COPY main.py /app/main.py
# Running the script from K8s not Docker so no command here.
Just a note regarding the source code commands, I changed one of my DICOM folders into a NIFTI file (making sure to add the CT_ to the start), I also changed the input_dir
target to that specific directory and attempted to run MOOSE in lib mode to see if I get the same issue and I do. When using the batch mode, these are not used at all :)
On the K8s side, The command I am using when the docker container starts up is:
K8s stuff here
- command: ["/bin/bash"]
args: ["-c moosez -d /data/Pilot -m clin_ct_lungs"]
More K8s stuff here
So this would need to be added to the Dockerfile instead if not running on K8s: CMD ["/bin/bash", "-c", "moosez -d /data/Pilot -m clin_ct_lungs"]
at the bottom of the Dockerfile.
With regards to GPU usage, apologies I do not know the "proper" way to monitor it, so I simply got a 2nd terminal connected to the pod and kept spamming nvidia-smi
as it ran the process (we do get some usage although very small):
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:06:00.0 Off | 0 |
| N/A 35C P0 83W / 400W | 7MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
@Keyn34 thanks for hopping on and helping as well, here is the screenshot of the env in the pod running on the cluster!
Command to search for a models directory: find / -type d -name "models" 2>/dev/null
the results are:
/opt/miniconda3/pkgs/pip-24.0-py312h06a4308_0/lib/python3.12/site-packages/pip/_internal/models
/opt/miniconda3/pkgs/conda-24.4.0-py312h06a4308_0/lib/python3.12/site-packages/conda/models
/opt/miniconda3/pkgs/qt-main-5.15.2-h53bd1ea_10/share/qt/3rd_party_licenses/qtquick3d/src/3rdparty/assimp/src/test/models
/opt/miniconda3/pkgs/torchvision-0.18.0-py310_cu118/lib/python3.10/site-packages/torchvision/models
/opt/miniconda3/pkgs/torchaudio-2.3.0-py310_cu118/info/test/test/torchaudio_unittest/models
/opt/miniconda3/pkgs/torchaudio-2.3.0-py310_cu118/lib/python3.10/site-packages/torchaudio/models
/opt/miniconda3/pkgs/torchaudio-2.3.0-py310_cu118/lib/python3.10/site-packages/torchaudio/prototype/models
/opt/miniconda3/pkgs/pip-24.0-py310h06a4308_0/lib/python3.10/site-packages/pip/_internal/models
/opt/miniconda3/pkgs/conda-24.5.0-py310h06a4308_0/lib/python3.10/site-packages/conda/models
/opt/miniconda3/pkgs/conda-24.5.0-py312h06a4308_0/lib/python3.12/site-packages/conda/models
/opt/miniconda3/lib/python3.10/site-packages/conda/models
/opt/miniconda3/lib/python3.10/site-packages/pip/_internal/models
/opt/miniconda3/lib/python3.10/site-packages/torchvision/models
/opt/miniconda3/lib/python3.10/site-packages/torchaudio/models
/opt/miniconda3/lib/python3.10/site-packages/torchaudio/prototype/models
/opt/miniconda3/share/qt/3rd_party_licenses/qtquick3d/src/3rdparty/assimp/src/test/models
/opt/miniconda3/models
I hope that's what you were after?
With regards to testing, if you need any assistance I would be more than happy to help out!
I hope that's what you were after?
Almost @Cam-Wheeler, I would love to see the contents of the models folder. :D I see that there is /opt/miniconda3/models
. Can you let me know what is inside of it?
Ah sorry bout that!!! The content of that dir is nnunet_trained_models
and the content of that directory is Dataset123_Organs/nnUNetTrainer_2000epochs_NoMirroring__nnUNetPlans__3d_fullres
which has:
dataset.json dataset_fingerprint.json fold_all plans.json
Apologies if that is a pain to read, here is the photo!
Inside of fold_all
we have checkpoint_final.pth debug.json progress.png validation
@Cam-Wheeler good stuff! Seems like the model is there. So there are two things that can go wrong, either the GPU is not being triggered or something else.
Please let me know the following:
watch nvidia-smi
to automatically refresh the gpu usage.In the meantime, I created a moosev2 docker image, which works in our server, I will ask @mprires to upload this to aws. I will drop the installation instructions here and let me know if you can run a prebuilt image without issues. I think this shouldn't be a problem. We used to run moosev0.1 like that and we ditched docker since we made moosez a package. But I can see that docker might still be needed. And you will hopefully be the first one to test it :)
@LalithShiyam I have a feeling you're onto something with the GPU not getting activated. When using the watch nvidia-smi
command, that holds still at 7MiB and does not shift when MOOSE is doing its thing. With regards to the docker run
command, I actually do not use one. We operate a Kueue system for the computing cluster so its a sort of set it and forget it situation so I believe K8s handles that for me, apologies if that's not very helpful!
Thank you for making the image, I look forward to be the first person to play around with it once its up and working :) you guys are doing some awesome stuff with MOOSE!!!
Many thanks for the kind words, the docker image is in aws now.
Please do the following and let us know if it works. It's our first attempt outside, so please bear with us.
mkdir moose_dckr
cd moose_dckr
wget "https://enhance-pet.s3.eu-central-1.amazonaws.com/moose/moosezv2_290524_dckr_img.tar"
docker load < moosezv2_290524_dckr_img.tar
docker run --gpus all --rm --ipc=host -v '/home/mz/Documents/Projects/Lalith/MACOSX/Data/Test/MOOSE':'/data' moosez:latest /data/Aarthi clin_ct_lungs
In our case the data to be MOOSE'd are in /home/mz/Documents/Projects/Lalith/MACOSX/Data/Test/MOOSE/Aarthi
and Aarthi is the directory that has multiple subjects, so you mount /home/mz/Documents/Projects/Lalith/MACOSX/Data/Test/MOOSE/
to the '/data' folder inside the docker image and then pass the actual folder to MOOSE'd /data/Aarthi
during the docker run
.
In the meantime, I will figure out how to run this in Kueue. Usually these are the commands: --gpus all --rm --ipc=host
that need to be in the docker run for the GPUs to be utilised.
Cheers, Lalith
OK, I just GPT'd it. So for k8, Kueue, it seems we need to modify the config file for the current image. Is it under your jurisdiction, can you even do that?
Hello, sorry for the late reply was just heading home from uni. Is this the config file on the cluster? This is something I can bring up with the administrators if that's the case (although they make take a day or two to respond). What details should I be asking about / what details need to change :)?
I also have access to a SLURM cluster that I can use in the meantime (although not as GPU chunky) would that make life easier while the admin team respond??
Since I haven't used k8 myself and not that smart enough. I asked GPT4o and it said this:
If the user is using a Kubernetes cluster with a queueing system like Kueue for job scheduling and managing resources, the docker run
command would be replaced with a Kubernetes job specification. Here’s how you can set it up:
Create a Kubernetes job YAML file to define the job. Below is an example of how the job specification might look:
apiVersion: batch/v1
kind: Job
metadata:
name: moosez-job
spec:
template:
spec:
containers:
- name: moosez
image: moosez:latest
command: ["moosez"]
args: ["-d", "/data/path_to_data", "-m", "model_name"]
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: data-volume
mountPath: /data
restartPolicy: Never
volumes:
- name: data-volume
hostPath:
path: /path/to/your/local/data
backoffLimit: 4
Once you have created the job YAML file (e.g., moosez-job.yaml
), you can apply it to your Kubernetes cluster:
kubectl apply -f moosez-job.yaml
You can monitor the job using the following command:
kubectl get jobs
To see the logs of the job, use:
kubectl logs job/moosez-job
If you are using a more advanced queueing system like Kueue, you might need to integrate with its specific APIs or CLI commands to submit and manage jobs. The Kubernetes job specification might need to be adjusted according to the requirements of Kueue.
Here’s a general approach:
kubectl
commands.For a Kubernetes setup with a queueing system, you would replace the docker run
command with a Kubernetes job specification, apply the job using kubectl
, and monitor it using standard Kubernetes tools or tools provided by the queueing system like Kueue. This approach allows you to leverage Kubernetes' scheduling and resource management capabilities.
@Cam-Wheeler sure, give it a shot - honestly we don't need more than 8-12 GB of GPU. Should work fine. I tested the docker in 2 different systems: server and a workstation - works well. Keep me posted. If it works on your system. I can create an addendum to the readMe.
lol that's exactly what GPT cooked up for me when I asked to K8s the process! Based on my current setup in K8s it shouldn't need much of a change! I shall get logged in now do some editing and submit the job. Once I get the details I will come back and let yall know!
Thank you so much again 🫎.
Wonderful - just be careful with the paths! Happy to get this sorted - so that you can finish your dissertation without trauma :D
Hello at @LalithShiyam, me again!
Okay, so I reckon it is an issue with my cluster (or perhaps how env variables work within docker). After transferring to a SLURM cluster. The same conda env that is not working in the K8s + Docker cluster works perfectly! With regards to the MOOSE docker image I sadly get the same result as before. nnUNet doesn't seem to be able to find the models. I also ran the command directly to see what happened.
root@s2562095-ip-pilot-2nvzq:/app# nnUNetv2_predict -i /data/Pilot/S1/moosez-clin_ct_lungs-2024-05-30-06-26-16/CT/temp -o /data/Pilot/S1/moosez-clin_ct_lungs-2024-05-30-06-26-16/segmentations -d 333 -c 3d_fullres -f all -tr nnUNetTrainer_2000epochs_NoMirroring --disable_tta -device cuda
nnUNet_raw is not defined and nnU-Net can only be used on data for which preprocessed files are already present on your system. nnU-Net cannot be used for experiment planning and preprocessing like this. If this is not intended, please read documentation/setting_up_paths.md for information on how to set this up properly.
nnUNet_preprocessed is not defined and nnU-Net can not be used for preprocessing or training. If this is not intended, please read documentation/setting_up_paths.md for information on how to set this up.
nnUNet_results is not defined and nnU-Net cannot be used for training or inference. If this is not intended behavior, please read documentation/setting_up_paths.md for information on how to set this up.
#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################
Traceback (most recent call last):
File "/usr/local/bin/nnUNetv2_predict", line 8, in <module>
sys.exit(predict_entry_point())
File "/usr/local/lib/python3.10/site-packages/nnunetv2/inference/predict_from_raw_data.py", line 828, in predict_entry_point
model_folder = get_output_folder(args.d, args.tr, args.p, args.c)
File "/usr/local/lib/python3.10/site-packages/nnunetv2/utilities/file_path_utilities.py", line 22, in get_output_folder
tmp = join(nnUNet_results, maybe_convert_to_dataset_name(dataset_name_or_id),
File "/usr/local/lib/python3.10/site-packages/nnunetv2/utilities/dataset_name_id_conversion.py", line 74, in maybe_convert_to_dataset_name
return convert_id_to_dataset_name(dataset_name_or_id)
File "/usr/local/lib/python3.10/site-packages/nnunetv2/utilities/dataset_name_id_conversion.py", line 48, in convert_id_to_dataset_name
raise RuntimeError(f"Could not find a dataset with the ID {dataset_id}. Make sure the requested dataset ID "
RuntimeError: Could not find a dataset with the ID 333. Make sure the requested dataset ID exists and that nnU-Net knows where raw and preprocessed data are located (see Documentation - Installation). Here are your currently defined folders:
nnUNet_preprocessed=None
nnUNet_results=None
nnUNet_raw=None
If something is not right, adapt your environment variables.
I am keen to sort this out just in case it comes in handy at some point, so I shall get everything running on the SLURM cluster; then I will take a dive into the issues in the Docker + K8s and see what I can find while I wait for MOOSE to do its thing :). If I get it working ill be sure to pass the message back to yall, it might not even be an issue your side and could totally be due to administration issues on mine. Either way ill come back once its solved so you guys know for the future!
Hi @Cam-Wheeler fantastic, atleast slurm works - Phew! I agree with your thoughts on k8+Docker. If you ever crack it - let me know or even feel free to make a PR with the instructions in the readMe. Would love to have you as a collaborator! Keep up the good work and let me know if you need something - happy to help.
Can I close this issue since it's working on slurm?
@LalithShiyam Thank you very much, if I get it sorted ill get back to you. Yes you may close it now! Thanks again to you and @Keyn34 for all the help and effort on your end it's really appreciated! 🫎
@Cam-Wheeler makes it easier when the created issue is so clear, elaborate and helpful - cheers and thanks for taking the time!
Hello :)
I am trying to use MOOSE for my dissertation and I have run into a problem when using batch mode that I was hoping I could get some help with. My understanding from the docs is that batch mode should be okay with DICOM images (what I have) but if I wanted to use MOOSE in my own scripts I would need to convert to NIFTI files first (not done this yet, wanted to explore batch mode first). However, when it comes to post-processing, I get an IndexError (stating that the list index is out of range). It appears as though when running the nnUnetv2 command, it does not save to my output directory, so when we go to collect the saved output, it's not there! I was wondering if you had come across this issue before and might be able to point me in the right direction on where to look!
After some digging, it appears as though nnUNetv2 has some issues:
To Reproduce I have tried to replicate the directory structure as closely as possible:
I am running MOOSE with
moosez -d /data/Pilot/ -m clin_ct_organs
. Where /data/Pilot/ is the root of all the patient data, so S1 is in there for my Pilot testing. But I will have S1, S2, ... S27 for my full dataset. I am working with total body PET/CT images.Additional context I am running the script within a docker container on my university cluster using K8s. I have a Pod running the container with a PVC mounted onto the pod. I have checked the write permissions, all is okay there! I am using a conda env with python 3.10.14 and running MOOSE version 2.4.10. I am using an NVIDIA A100 with 40Gb of memory so I should be okay memory-wise.
When running, all of the directories that MOOSE creates for us are created and they are filled with the nii.gz images (just one image in my pilot case but will have more later!). I have also listed the
output_dir
after the command -->Output Directory after the command is run: ['plans.json', 'dataset.json', 'predict_from_raw_data_args.json']
.After reading #78 have I messed up the formats?
Edit: After reading some of the other issues, I have added more info :)