BUG: nnUnetv2 Command Not Saving to `output_dir`

Cam-Wheeler commented 5 months ago

Hello :)

I am trying to use MOOSE for my dissertation and I have run into a problem when using batch mode that I was hoping I could get some help with. My understanding from the docs is that batch mode should be okay with DICOM images (what I have) but if I wanted to use MOOSE in my own scripts I would need to convert to NIFTI files first (not done this yet, wanted to explore batch mode first). However, when it comes to post-processing, I get an IndexError (stating that the list index is out of range). It appears as though when running the nnUnetv2 command, it does not save to my output directory, so when we go to collect the saved output, it's not there! I was wondering if you had come across this issue before and might be able to point me in the right direction on where to look!

Traceback (most recent call last):
  File "/opt/miniconda3/bin/moosez", line 8, in <module>
    sys.exit(main())
  File "/opt/miniconda3/lib/python3.10/site-packages/moosez/moosez.py", line 199, in main
    predict.predict(model_name, input_dir, output_dir, accelerator)
  File "/opt/miniconda3/lib/python3.10/site-packages/moosez/predict.py", line 82, in predict
    postprocess(original_image_files[0], output_dir, model_name)
  File "/opt/miniconda3/lib/python3.10/site-packages/moosez/predict.py", line 137, in postprocess
    predicted_image = file_utilities.get_files(output_dir, '.nii.gz')[0]
IndexError: list index out of range

After some digging, it appears as though nnUNetv2 has some issues:

Traceback (most recent call last):
  File "/opt/miniconda3/bin/nnUNetv2_predict", line 8, in <module>
    sys.exit(predict_entry_point())
  File "/opt/miniconda3/lib/python3.10/site-packages/nnunetv2/inference/predict_from_raw_data.py", line 828, in predict_entry_point
    model_folder = get_output_folder(args.d, args.tr, args.p, args.c)
  File "/opt/miniconda3/lib/python3.10/site-packages/nnunetv2/utilities/file_path_utilities.py", line 22, in get_output_folder
    tmp = join(nnUNet_results, maybe_convert_to_dataset_name(dataset_name_or_id),
  File "/opt/miniconda3/lib/python3.10/site-packages/nnunetv2/utilities/dataset_name_id_conversion.py", line 74, in maybe_convert_to_dataset_name
    return convert_id_to_dataset_name(dataset_name_or_id)
  File "/opt/miniconda3/lib/python3.10/site-packages/nnunetv2/utilities/dataset_name_id_conversion.py", line 48, in convert_id_to_dataset_name
    raise RuntimeError(f"Could not find a dataset with the ID {dataset_id}. Make sure the requested dataset ID "
RuntimeError: Could not find a dataset with the ID 333. Make sure the requested dataset ID exists and that nnU-Net knows where raw and preprocessed data are located (see Documentation - Installation). Here are your currently defined folders:
nnUNet_preprocessed=None
nnUNet_results=None
nnUNet_raw=None
If something is not right, adapt your environment variables.

To Reproduce I have tried to replicate the directory structure as closely as possible:

.
`-- S1
    |-- AC-CT
    |   |-- 1-001.dcm
    |   |-- 1-002.dcm
    |   |-- 1-003.dcm
    |   |-- (more DICOM here)
    |-- AC-PT
    |   |-- 1-001.dcm
    |   |-- 1-002.dcm
    |   |-- 1-003.dcm
    |   |-- (more DICOM here)

I am running MOOSE with moosez -d /data/Pilot/ -m clin_ct_organs. Where /data/Pilot/ is the root of all the patient data, so S1 is in there for my Pilot testing. But I will have S1, S2, ... S27 for my full dataset. I am working with total body PET/CT images.

Additional context I am running the script within a docker container on my university cluster using K8s. I have a Pod running the container with a PVC mounted onto the pod. I have checked the write permissions, all is okay there! I am using a conda env with python 3.10.14 and running MOOSE version 2.4.10. I am using an NVIDIA A100 with 40Gb of memory so I should be okay memory-wise.

When running, all of the directories that MOOSE creates for us are created and they are filled with the nii.gz images (just one image in my pilot case but will have more later!). I have also listed the output_dir after the command --> Output Directory after the command is run: ['plans.json', 'dataset.json', 'predict_from_raw_data_args.json'].

After reading #78 have I messed up the formats?

Edit: After reading some of the other issues, I have added more info :)

LalithShiyam commented 5 months ago

Hi @Cam-Wheeler, probably my most favorite issue so far. Many thanks for the elaborate steps - really helps me understand the issue.

From the first glance, it seems that the nnUNet paths are not set, but then you had mentioned that output_dir inside the MOOSE-xxx folder has the JSON files that will only be created when the predict is run/initiated. Either the process is killed or the way we set up paths for nnunet via os.environment is not the right way for docker based runs. But that doesn't explain the jsons being created.

So I am not sure what exactly is happening. And we haven't tested moosev2 with docker - which should have been done. May be I take this opportunity to do that.

I have a follow up question - do you see the gpu being utilized? When you run the docker image? This would help me get an idea. Also can you share the DockerFile you used to create the image, so that I can reproduce the error in my server? Also the docker run command that you used to run the image.

Don't worry - we will get this going asap so that you have no hassles with your dissertarion! :)

Cheers, Lalith

LalithShiyam commented 5 months ago

@mprires @Keyn34 adding you both since this is interesting 🤨

Keyn34 commented 5 months ago

Hi @Cam-Wheeler, can you navigate to the environment where you installed moosez, search for a models folder there, and post a screenshot of its contents, please?

It says, Could not find a dataset with the ID 333, so it might be that docker does not allow the connection to our AWS repository for the models.

Cam-Wheeler commented 5 months ago

Hello @LalithShiyam!! Thanks for the response, it's much appreciated!

Okay, here is the Dockerfile:

# Install base image and set shell.
FROM --platform=linux/amd64 nvcr.io/nvidia/cuda:12.0.0-cudnn8-devel-ubuntu22.04
SHELL ["/bin/bash", "-c"]

#### Conda Env ####
# Download and install Miniconda3.
RUN apt update && apt upgrade -y
RUN apt install -y tree wget curl git
RUN mkdir -p /opt/conda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /opt/conda/miniconda.sh
RUN chmod +x /opt/conda/miniconda.sh
RUN bash /opt/conda/miniconda.sh -b -u -p /opt/miniconda3
RUN rm -rf /opt/conda/miniconda.sh

# Update our path so that we can access conda.
ENV PATH="/opt/miniconda3/bin:$PATH"
RUN conda init bash
RUN conda update -n base -c defaults conda -y
RUN pip install --upgrade pip

#### Dependencies ####
# Conda packages
RUN conda install python=3.10 -y
RUN conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
RUN conda install numpy pandas matplotlib -y

# Pip only packages
RUN pip install moosez pydicom

#### Source Code ####
# Set the working directory.
WORKDIR /app
COPY runners /app/runners 
COPY src /app/src
COPY main.py /app/main.py

# Running the script from K8s not Docker so no command here.

Just a note regarding the source code commands, I changed one of my DICOM folders into a NIFTI file (making sure to add the CT_ to the start), I also changed the input_dir target to that specific directory and attempted to run MOOSE in lib mode to see if I get the same issue and I do. When using the batch mode, these are not used at all :)

On the K8s side, The command I am using when the docker container starts up is:

K8s stuff here

- command: ["/bin/bash"]
   args: ["-c moosez -d /data/Pilot -m clin_ct_lungs"]

More K8s stuff here

So this would need to be added to the Dockerfile instead if not running on K8s: CMD ["/bin/bash", "-c", "moosez -d /data/Pilot -m clin_ct_lungs"] at the bottom of the Dockerfile.

With regards to GPU usage, apologies I do not know the "proper" way to monitor it, so I simply got a 2nd terminal connected to the pod and kept spamming nvidia-smi as it ran the process (we do get some usage although very small):

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:06:00.0 Off |                    0 |
| N/A   35C    P0              83W / 400W |      7MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

@Keyn34 thanks for hopping on and helping as well, here is the screenshot of the env in the pod running on the cluster! Command to search for a models directory: find / -type d -name "models" 2>/dev/null the results are:

/opt/miniconda3/pkgs/pip-24.0-py312h06a4308_0/lib/python3.12/site-packages/pip/_internal/models
/opt/miniconda3/pkgs/conda-24.4.0-py312h06a4308_0/lib/python3.12/site-packages/conda/models
/opt/miniconda3/pkgs/qt-main-5.15.2-h53bd1ea_10/share/qt/3rd_party_licenses/qtquick3d/src/3rdparty/assimp/src/test/models
/opt/miniconda3/pkgs/torchvision-0.18.0-py310_cu118/lib/python3.10/site-packages/torchvision/models
/opt/miniconda3/pkgs/torchaudio-2.3.0-py310_cu118/info/test/test/torchaudio_unittest/models
/opt/miniconda3/pkgs/torchaudio-2.3.0-py310_cu118/lib/python3.10/site-packages/torchaudio/models
/opt/miniconda3/pkgs/torchaudio-2.3.0-py310_cu118/lib/python3.10/site-packages/torchaudio/prototype/models
/opt/miniconda3/pkgs/pip-24.0-py310h06a4308_0/lib/python3.10/site-packages/pip/_internal/models
/opt/miniconda3/pkgs/conda-24.5.0-py310h06a4308_0/lib/python3.10/site-packages/conda/models
/opt/miniconda3/pkgs/conda-24.5.0-py312h06a4308_0/lib/python3.12/site-packages/conda/models
/opt/miniconda3/lib/python3.10/site-packages/conda/models
/opt/miniconda3/lib/python3.10/site-packages/pip/_internal/models
/opt/miniconda3/lib/python3.10/site-packages/torchvision/models
/opt/miniconda3/lib/python3.10/site-packages/torchaudio/models
/opt/miniconda3/lib/python3.10/site-packages/torchaudio/prototype/models
/opt/miniconda3/share/qt/3rd_party_licenses/qtquick3d/src/3rdparty/assimp/src/test/models
/opt/miniconda3/models

I hope that's what you were after?

With regards to testing, if you need any assistance I would be more than happy to help out!

Keyn34 commented 5 months ago

I hope that's what you were after?

Almost @Cam-Wheeler, I would love to see the contents of the models folder. :D I see that there is /opt/miniconda3/models. Can you let me know what is inside of it?

Cam-Wheeler commented 5 months ago

Ah sorry bout that!!! The content of that dir is nnunet_trained_models and the content of that directory is Dataset123_Organs/nnUNetTrainer_2000epochs_NoMirroring__nnUNetPlans__3d_fullres which has:

dataset.json  dataset_fingerprint.json  fold_all  plans.json

Apologies if that is a pain to read, here is the photo!

Inside of fold_all we have checkpoint_final.pth debug.json progress.png validation

LalithShiyam commented 5 months ago

@Cam-Wheeler good stuff! Seems like the model is there. So there are two things that can go wrong, either the GPU is not being triggered or something else.

Please let me know the following:

How you run the image e.g. "docker run..."
You can use watch nvidia-smi to automatically refresh the gpu usage.

In the meantime, I created a moosev2 docker image, which works in our server, I will ask @mprires to upload this to aws. I will drop the installation instructions here and let me know if you can run a prebuilt image without issues. I think this shouldn't be a problem. We used to run moosev0.1 like that and we ditched docker since we made moosez a package. But I can see that docker might still be needed. And you will hopefully be the first one to test it :)

Cam-Wheeler commented 5 months ago

@LalithShiyam I have a feeling you're onto something with the GPU not getting activated. When using the watch nvidia-smi command, that holds still at 7MiB and does not shift when MOOSE is doing its thing. With regards to the docker run command, I actually do not use one. We operate a Kueue system for the computing cluster so its a sort of set it and forget it situation so I believe K8s handles that for me, apologies if that's not very helpful!

Thank you for making the image, I look forward to be the first person to play around with it once its up and working :) you guys are doing some awesome stuff with MOOSE!!!

LalithShiyam commented 5 months ago

Many thanks for the kind words, the docker image is in aws now.

Please do the following and let us know if it works. It's our first attempt outside, so please bear with us.


mkdir moose_dckr
cd moose_dckr
wget "https://enhance-pet.s3.eu-central-1.amazonaws.com/moose/moosezv2_290524_dckr_img.tar"
docker load < moosezv2_290524_dckr_img.tar
docker run --gpus all --rm --ipc=host -v '/home/mz/Documents/Projects/Lalith/MACOSX/Data/Test/MOOSE':'/data' moosez:latest /data/Aarthi clin_ct_lungs

In our case the data to be MOOSE'd are in /home/mz/Documents/Projects/Lalith/MACOSX/Data/Test/MOOSE/Aarthi and Aarthi is the directory that has multiple subjects, so you mount /home/mz/Documents/Projects/Lalith/MACOSX/Data/Test/MOOSE/ to the '/data' folder inside the docker image and then pass the actual folder to MOOSE'd /data/Aarthi during the docker run.

In the meantime, I will figure out how to run this in Kueue. Usually these are the commands: --gpus all --rm --ipc=host that need to be in the docker run for the GPUs to be utilised.

Cheers, Lalith

LalithShiyam commented 5 months ago

OK, I just GPT'd it. So for k8, Kueue, it seems we need to modify the config file for the current image. Is it under your jurisdiction, can you even do that?

Cam-Wheeler commented 5 months ago

Hello, sorry for the late reply was just heading home from uni. Is this the config file on the cluster? This is something I can bring up with the administrators if that's the case (although they make take a day or two to respond). What details should I be asking about / what details need to change :)?

I also have access to a SLURM cluster that I can use in the meantime (although not as GPU chunky) would that make life easier while the admin team respond??

LalithShiyam commented 5 months ago

Since I haven't used k8 myself and not that smart enough. I asked GPT4o and it said this:

If the user is using a Kubernetes cluster with a queueing system like Kueue for job scheduling and managing resources, the docker run command would be replaced with a Kubernetes job specification. Here’s how you can set it up:

1. Kubernetes Job Specification

Create a Kubernetes job YAML file to define the job. Below is an example of how the job specification might look:

apiVersion: batch/v1
kind: Job
metadata:
  name: moosez-job
spec:
  template:
    spec:
      containers:
      - name: moosez
        image: moosez:latest
        command: ["moosez"]
        args: ["-d", "/data/path_to_data", "-m", "model_name"]
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: data-volume
          mountPath: /data
      restartPolicy: Never
      volumes:
      - name: data-volume
        hostPath:
          path: /path/to/your/local/data
  backoffLimit: 4

2. Applying the Job

Once you have created the job YAML file (e.g., moosez-job.yaml), you can apply it to your Kubernetes cluster:

kubectl apply -f moosez-job.yaml

3. Monitoring the Job

You can monitor the job using the following command:

kubectl get jobs

To see the logs of the job, use:

kubectl logs job/moosez-job

Explanation of the YAML File

apiVersion: Specifies the API version.
kind: Specifies the kind of resource (Job).
metadata: Contains metadata such as the name of the job.
spec: Defines the specifications for the job.
- template: Template for the pod that will be created by the job.
- spec: Defines the specifications for the pod.
  - containers: Lists the containers to run in the pod.
  - name: Name of the container.
  - image: Docker image to use.
  - command: The command to run in the container.
  - args: Arguments for the command.
  - resources: Resource limits for the container.
  - volumeMounts: Mounts a volume into the container.
  - restartPolicy: Policy for restarting the pod.
  - volumes: Volumes to be mounted in the pod.
backoffLimit: Number of retries before marking the job as failed.

Using a Queueing System like Kueue

If you are using a more advanced queueing system like Kueue, you might need to integrate with its specific APIs or CLI commands to submit and manage jobs. The Kubernetes job specification might need to be adjusted according to the requirements of Kueue.

Here’s a general approach:

Create a Job Specification: Similar to the above example.
Submit the Job: Using Kueue's submission commands if they differ from standard kubectl commands.
Monitor the Job: Using Kueue’s monitoring tools.

Summary

For a Kubernetes setup with a queueing system, you would replace the docker run command with a Kubernetes job specification, apply the job using kubectl, and monitor it using standard Kubernetes tools or tools provided by the queueing system like Kueue. This approach allows you to leverage Kubernetes' scheduling and resource management capabilities.

LalithShiyam commented 5 months ago

@Cam-Wheeler sure, give it a shot - honestly we don't need more than 8-12 GB of GPU. Should work fine. I tested the docker in 2 different systems: server and a workstation - works well. Keep me posted. If it works on your system. I can create an addendum to the readMe.

Cam-Wheeler commented 5 months ago

lol that's exactly what GPT cooked up for me when I asked to K8s the process! Based on my current setup in K8s it shouldn't need much of a change! I shall get logged in now do some editing and submit the job. Once I get the details I will come back and let yall know!

Thank you so much again 🫎.

LalithShiyam commented 5 months ago

Wonderful - just be careful with the paths! Happy to get this sorted - so that you can finish your dissertation without trauma :D

Cam-Wheeler commented 5 months ago

Hello at @LalithShiyam, me again!

Okay, so I reckon it is an issue with my cluster (or perhaps how env variables work within docker). After transferring to a SLURM cluster. The same conda env that is not working in the K8s + Docker cluster works perfectly! With regards to the MOOSE docker image I sadly get the same result as before. nnUNet doesn't seem to be able to find the models. I also ran the command directly to see what happened.

root@s2562095-ip-pilot-2nvzq:/app# nnUNetv2_predict -i /data/Pilot/S1/moosez-clin_ct_lungs-2024-05-30-06-26-16/CT/temp -o /data/Pilot/S1/moosez-clin_ct_lungs-2024-05-30-06-26-16/segmentations -d 333 -c 3d_fullres -f all -tr nnUNetTrainer_2000epochs_NoMirroring --disable_tta -device cuda
nnUNet_raw is not defined and nnU-Net can only be used on data for which preprocessed files are already present on your system. nnU-Net cannot be used for experiment planning and preprocessing like this. If this is not intended, please read documentation/setting_up_paths.md for information on how to set this up properly.
nnUNet_preprocessed is not defined and nnU-Net can not be used for preprocessing or training. If this is not intended, please read documentation/setting_up_paths.md for information on how to set this up.
nnUNet_results is not defined and nnU-Net cannot be used for training or inference. If this is not intended behavior, please read documentation/setting_up_paths.md for information on how to set this up.

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

Traceback (most recent call last):
  File "/usr/local/bin/nnUNetv2_predict", line 8, in <module>
    sys.exit(predict_entry_point())
  File "/usr/local/lib/python3.10/site-packages/nnunetv2/inference/predict_from_raw_data.py", line 828, in predict_entry_point
    model_folder = get_output_folder(args.d, args.tr, args.p, args.c)
  File "/usr/local/lib/python3.10/site-packages/nnunetv2/utilities/file_path_utilities.py", line 22, in get_output_folder
    tmp = join(nnUNet_results, maybe_convert_to_dataset_name(dataset_name_or_id),
  File "/usr/local/lib/python3.10/site-packages/nnunetv2/utilities/dataset_name_id_conversion.py", line 74, in maybe_convert_to_dataset_name
    return convert_id_to_dataset_name(dataset_name_or_id)
  File "/usr/local/lib/python3.10/site-packages/nnunetv2/utilities/dataset_name_id_conversion.py", line 48, in convert_id_to_dataset_name
    raise RuntimeError(f"Could not find a dataset with the ID {dataset_id}. Make sure the requested dataset ID "
RuntimeError: Could not find a dataset with the ID 333. Make sure the requested dataset ID exists and that nnU-Net knows where raw and preprocessed data are located (see Documentation - Installation). Here are your currently defined folders:
nnUNet_preprocessed=None
nnUNet_results=None
nnUNet_raw=None
If something is not right, adapt your environment variables.

I am keen to sort this out just in case it comes in handy at some point, so I shall get everything running on the SLURM cluster; then I will take a dive into the issues in the Docker + K8s and see what I can find while I wait for MOOSE to do its thing :). If I get it working ill be sure to pass the message back to yall, it might not even be an issue your side and could totally be due to administration issues on mine. Either way ill come back once its solved so you guys know for the future!

LalithShiyam commented 5 months ago

Hi @Cam-Wheeler fantastic, atleast slurm works - Phew! I agree with your thoughts on k8+Docker. If you ever crack it - let me know or even feel free to make a PR with the instructions in the readMe. Would love to have you as a collaborator! Keep up the good work and let me know if you need something - happy to help.

Can I close this issue since it's working on slurm?

Cam-Wheeler commented 5 months ago

@LalithShiyam Thank you very much, if I get it sorted ill get back to you. Yes you may close it now! Thanks again to you and @Keyn34 for all the help and effort on your end it's really appreciated! 🫎

LalithShiyam commented 5 months ago

@Cam-Wheeler makes it easier when the created issue is so clear, elaborate and helpful - cheers and thanks for taking the time!

ENHANCE-PET / MOOSE