command not found - Githubissues

sovanlal commented 1 year ago

Hi,

I was trying to run nnDetection. However, while running the command

nndet_prep 500 --full_check

it says, /bin/bash: nndet_prep: command not found.

I suspect it might be an issue by setting up the environmental variables which I set using

os.environ['det_data'] = os.path.join(nnDetDir,'nnDet_raw_data') os.environ['det_models'] = os.path.join(nnDetDir,'nnDet_trained_models') os.environ['OMP_NUM_THREADS']="1"

where "nnDet_raw_data" is the directory where the data (imagesTr, labelsTr, imagesTs and labelsTs) is located and 'nnDet_trained_models' is the directory where the trained models will be saved.

Could you please help me figure out what might be causing an issue?

mibaumgartner commented 1 year ago

Hi, the installation probably failed, please check the FAQ to rule out common mistakes and provide the installation log.

Best, Michael

sovanlal commented 1 year ago

Hi Michael,

here is the system information i am getting after running scripts/utils.py

----- PyTorch Information ----- PyTorch Version: 1.11.0+cu113 PyTorch Debug: False PyTorch CUDA: 11.3 PyTorch Backend cudnn: 8200 PyTorch CUDA Arch List: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86'] PyTorch Current Device Capability: (7, 5) PyTorch CUDA available: True

----- System Information ----- System NVCC: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Mon_May__3_19:15:13_PDT_2021 Cuda compilation tools, release 11.3, V11.3.109 Build cuda_11.3.r11.3/compiler.29920130_0

System Arch List: None System OMP_NUM_THREADS: 1 System CUDA_HOME is None: True System CPU Count: 8 Python Version: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) [GCC 9.4.0]

----- nnDetection Information ----- det_num_threads 6 det_data is set True det_models is set True

Is there anything that look susppicious? I am still getting the command not found error. Is there any particular directory that I need to run the command from?

mibaumgartner commented 1 year ago

nnDetection requires python 3.8 or up, could you try with that?

sovanlal commented 1 year ago

Hi Michael,

Thanks. I am currently using Google cloud which is unfortunately pre-built using python 3.7 and I can not change it. Does nnDetection not support python 3.7 at all?

mibaumgartner commented 1 year ago

There are some things like the walrus operator which are only supported under python 3.8+, so 3.7 unfortunately won't work

sovanlal commented 1 year ago

Hi Michael,

I was able to install it through Docker in WSL-Ubuntu environment. However, while running, it was showing CUDA out-of-memory error. My workstation has nvidia quadro GV100 32 GB. I checked the training with nndet_train XXX -o augment_cfg.multiprocessing=False and it's not showing the error although the training is very slow. Attached is the screenshot of nndet_env. Also, I noticed that low resolution plan was triggered. Is there any way I can do training with low resolution plan?

mibaumgartner commented 1 year ago

Dear @sovanlal ,

did you run the planning with the V100 as well? If yes, it might make sense to increase the offset (linked below) significantly, e.g. to 18GB or similar, so the resulting plan would roughly correspond to a 2080ti plan. Those run pretty nicely and you could scale up the batch size if you want to take full advantage of the GPU VRAM. (I need to run some additional testing with the offset, that is why it is not the default right now)

https://github.com/MIC-DKFZ/nnDetection/blob/d637c5e2da16e0fe7cf8a5b860907eb57e60d4fe/nndet/planning/estimator.py#L49

sovanlal commented 1 year ago

I have made the necessary changes. However, the training is stuck after sometime. I am not sure what is the exact reason. Will appreciate any help.

sovanlal commented 1 year ago

@mibaumgartner ,

Interestingly, first two epochs ran successfully while it was stuck at 3rd epoch. The training time was really fast for first two epochs (~15 mins). Any clues?

mibaumgartner commented 1 year ago

Can you check your VRAM consumption (this might be interesting for the quick training)?

Regarding getting stuck: Please double check that the training is not running our of memory (normal RAM), decreasing the number of threads would help with this. Also make sure to the OMP_NUM_THREADS environment variable.

sovanlal commented 1 year ago

@mibaumgartner ,

the number of threads I am using is 6 although I tried with 12 and 24 (maximum thread of my CPU) as well. OMP_NUM_THREADS=1 as is shown below by nndet_env. I am running it using docker container but the issue still persists

Not quite sure what is causing the problem.

mibaumgartner commented 1 year ago

Could you try to increase "--shm-size=24gb" to 36 or even higher, depending on how much memory is available. Also manually monitor CPU, GPU and RAM usage during the training please. I think something exceeds the available resources and thus it crashes.

I'm also wondering why you metrics are negative, this should only happen if not all classes are present during the online validation phase. Is it some kind of toy dataset?

sovanlal commented 1 year ago

@mibaumgartner ,

This might be a CUDA driver related issue that I am now trying to fix.

Also, for assigning class I am following the instruction for creating data set and label.json file. My dataset consists of 921 tumor segmentation (in .nii.gz) files,. So, I created the label.json file like below

My dataset do not have multiple tumors per patient. So it has only one instance (class 0) per patient. Is there anyting woring with this approach?

mibaumgartner commented 1 year ago

The label.json looks fine, did you enter more than one class in your dataset json?

Alright, a CUDA driver problem might indeed result in crashes.

sovanlal commented 1 year ago

@mibaumgartner ,

my dataset has only one class i.e. tumor and I have changed my dataset.json like below

{

"dim": 3,

"labels": {

    "0": "tumor"

},

"modalities": {

    "0": "CT"

},

"name": "tumor detection",

"target_class": 0,

"task": "Task500_TumorDetection",

"test_labels": true

}

Does it look good or need any modifications?

mibaumgartner commented 1 year ago

Something seems to be off from the logs and the dataset json you provided. The logs explicitly mention that no objects of class 1 were found for evaluation, but from your dataset.json and explanations there shouldn't be a class 1. This results in the negative validation metrics.

sovanlal commented 1 year ago

@mibaumgartner It looks like I managed to start the training without interruption. It's still running and I hope it does so for the rest of the training :-). I think the CUDA driver was an issue. I had CUDA driver 12.0 and eventually had to downgrade it. I also fixed the negative metrics issue.

sovanlal commented 1 year ago

@mibaumgartner ,

the metrics that are calculated while training, are those for training or validation data?

mibaumgartner commented 1 year ago

The metrics are computed on the validation data. During the training it is only computed on a predefined number of selected patches and not the entire images. Whole image inference and evaluation will be performed at the end of the training.

sovanlal commented 1 year ago

Hi @mibaumgartner ,

I wanted to see the bounding box predictions as image (in .nii).

I saw in one of the threads that nndet_box2nii is the required command. Could you please let me know what should be the exact input that is required for the command?

mibaumgartner commented 1 year ago

Is there anything specific you are missing? You can also retrieve additional information via the -h flag. All the info is also in the script here: https://github.com/MIC-DKFZ/nnDetection/blob/d637c5e2da16e0fe7cf8a5b860907eb57e60d4fe/scripts/utils.py#L33-L46

sovanlal commented 1 year ago

@mibaumgartner ,

Do you have the performance metrics (e.g., FROC@0.1, AP@0.1) for Task007_pancreas? I can see FROC only for LUNA dataset in your paper. Please let e know.

mibaumgartner commented 1 year ago

All of the results are in the repository, you can find them here: https://github.com/MIC-DKFZ/nnDetection/blob/main/docs/results/nnDetectionV001.md

sovanlal commented 1 year ago

Hi @mibaumgartner ,

I was running Task007_Pancreas dataset. The val_results that I got from fold0 is quite low e.g., FROC@IOU0.1 is 0.30 and AP@IOU0.1 is 0.19. Although I am yet to run all the folds , this does not seem to be right. Could you please help figure out what went wrong? resils_fold0

mibaumgartner commented 1 year ago

Indeed, the values are significantly too low, how does the training log look like?

sovanlal commented 1 year ago

Attached is the training log file.

train.log

mibaumgartner commented 1 year ago

Dear @sovanlal ,

the training log performance look rather suspicious, since the performance should be smaller during online validation should be smaller and the drop for the online validation should be smaller.

Given the extremely small patch size I think that you are using a rather small GPU which causes the problem -> during online validation objects are extracted from the patches and thus if the object is "large", the performance during training is good (since if basically always fills the entire patch) but during inference the the objects are "suddenly" not detected anymore which results in very bad performance. Training with a RTX2080ti should result in a patch size of [40, 224, 224] which is significantly larger than [10, 56, 56] from your training log.

sovanlal commented 1 year ago

@mibaumgartner ,

Does it mean I need to change my current GPU to RTX2080ti or does it need any modification to existing code? My current GPU is GV100 32GB.

mibaumgartner commented 1 year ago

Sorry, I lost scope of the previous conversations. I think the Tensor block size is too high which leads to the tiny patch size (in the estimator). Which value did you enter there? Did you perform any other changes? Could you provide the prepare/planning log as well to double check that there are not other problems witch the GPU? (The configured patch size is not suitable for your GPU VRAM)

sovanlal commented 1 year ago

@mibaumgartner ,

I used 10 GB as offset in these two places

"RTX2080TI": int(mb2b(10000))

offset: int = mb2b(10000),

These were the only changes I made.

mibaumgartner commented 1 year ago

Alright, than I would recommend to decrease the offset until you in a similar range than the above mentioned patch size (40, 224, 224)

sovanlal commented 1 year ago

@mibaumgartner ,

Can [32, 192, 192] be a good choice? This is the maximum patch size I got.

mibaumgartner commented 1 year ago

Definitely better than the old one but I'm still unsure why you get significantly reduced patch sizes with a 32GB VRAM GPU. I got the patch size of 40, 224, 224 on a RTX 2080ti with 11GB and the patch size for the same dataset should only depend on the available VRAM of the GPU.

sovanlal commented 1 year ago

I was able to get (40, 224, 224) but now it's showing CUDA out of memory error :-). I am not quite sure what;'s going on.

Do you know if anyone with GV100 had similar issues?

Logs are attached.

It also showed "Caught error (If out of memory error do not worry): CUDA error: unknown error CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect." error sometimes during planning.

train.log logging.log

mibaumgartner commented 1 year ago

I rechecked the planning stage and noticed that I already took care of the memory estimation for bigger GPUs which I forgot about (sorry). Even if more memory is available, it will only plan for a 11GB VRAM GPU => so the configuration should consume around 11GB as was also present on my local GPU.

This makes the training error, "CUDA out of memory" even more dubious though, since there is definitely enough memory available. I'm not aware of any Issues which reported out of memory on GV100 GPUs. I've trained several models on V100 GPUs in our cluster without any problems though. Can you manually check the memory consumption when you start the training? It seem rather odd that it would run out of memory.

sovanlal commented 1 year ago

GPU utilization was very low. I thought it might be GPU related issue. However, I am running nnUNet successfully (500 epochs and is still running on a 921 training dataset) with GPU utilization > 90%. There might be something going on with the nnDetection set up. Do I need to still need to increase the offset or keep it at default settings?

mibaumgartner commented 1 year ago

The offset should be kept at the default setting. nnDetection requires more CPU and IO resources than nnU-Net so they can not be compared directly. The FAQ includes an section on debugging bottlenecks, maybe there is something interesting in there? Slowly increasing the num_threads could help with a CPU bottleneck (maybe sure not run out of RAM)

MIC-DKFZ / nnDetection

command not found #139