Error during training of SYMNet with Oasis data set

Tranquillar commented 2 years ago

Hi,

I really enjoyed reading your paper and I want to reproduce the results. I am currently trying to train the model with the OASIS data set using the example file you provided ("Train_sym_neurite_oasis.py").

However I am hindered by two major problems:

Problem 1:

The validation scores are not even close to your results. Also the value doesn't seem to change at all, which is really confusing to me.

My validation scores:

0:0.554038160843976 5000:0.585584165076553 10000:0.585584165076553 15000:0.585584165076553 20000:0.585584165076553 25000:0.585584165076553 30000:0.585584165076553 35000:0.585584165076553

Your validation scores:

0:0.570166888906167 5000:0.7349817233331957 10000:0.7674420857250869 15000:0.7831680992948633 20000:0.7861187128159433 25000:0.7941986866278279 30000:0.792603563494231 35000:0.7965674164395773

Problem 2:

At some point the overall loss becomes negative infinity and then NaN. Training_Loss_OASIS

Afterwards the next validation will always cause an division-by-zero-error. Training_Loss_OASIS_2

This seems to happen every time. I am using the default parameters of the file. Training_hparams

Do you have any idea how to solve this problems? I would really appreciate your help on this issues.

Best regards Marcel

cwmok commented 2 years ago

Hi Marcel,

This is weird. I tested this script many times and it works well with the preprocessed OASIS dataset in https://github.com/adalca/medical-datasets/blob/master/neurite-oasis.md.

I notice that your Dice score at iteration 0 is different from mine. Where did you download your OASIS dataset? Could you print out the training and validation files you used in Train_sym_neurite_oasis.py? (Lines 87 and 148-152)

Regards, Tony

Tranquillar commented 2 years ago

Hi Tony,

thank you for the quick reply. I downloaded the dataset from the link https://github.com/adalca/medical-datasets/blob/master/neurite-oasis.md. There seems to be only one version with 3D images ("neurite-oasis.v1.0").

Output of training files: (Line 87) names = sorted(glob.glob(datapath + '/OASISOAS1*_MR1/aligned_norm.nii.gz'))[0:255]

OASIS_Training_Ouput_names ... OASIS_Training_Ouput_names_end

Output of validation files:

(Line 149) fixed_img = sorted(glob.glob(datapath + '/OASISOAS1_MR1/aligned_norm.nii.gz'))[255] (Line 150) fixed_label = sorted(glob.glob(datapath + '/OASISOAS1_MR1/aligned_seg35.nii.gz'))[255] (Line 151) imgs = sorted(glob.glob(datapath + '/OASISOAS1_MR1/aligned_norm.nii.gz'))[256:261] (Line 152) labels = sorted(glob.glob(datapath + '/OASISOAS1_MR1/aligned_seg35.nii.gz'))[256:261]

OASIS_Validation_Output

You also mentioned the difference in dice score in iteration 0. Isn't that somewhat expected because the initialization of weights in the network is random? (Correct me if I'm wrong here. ) I get different dice scores in iteration 0 every time I start the script. The values are usually between 0.4 and 0.6.

Thank you for your help. Best regards Marcel

Tranquillar commented 2 years ago

Quick additional information:

This warning comes up every time when I run the scripts for training or inference. OASIS_Training_Warning Does this have any significance to the model?

cwmok commented 2 years ago

You also mentioned the difference in dice score in iteration 0. Isn't that somewhat expected because the initialization of weights in the network is random? (Correct me if I'm wrong here. )

Yes, you are correct. I just want to make sure we are using the same data source.

This warning comes up every time when I run the scripts for training or inference.

This is a minor user warning and will not make a big difference in the result.

I cannot reproduce your result. Could you send me the "Functions.py", "Models.py" and "Train_sym_neurite_oasis.py" you used to see whether I can reproduce the same error?

Tranquillar commented 2 years ago

Here you go.

SymNet_Files.zip

cwmok commented 2 years ago

Hi @Tranquillar ,

I tried the code you provided, and it seems there is no problem at all.

Here is the log using your code: Validation Dice log for SYMNet_neurite_oasis: 0:0.5447214704388855 1000:0.6589802244014441 2000:0.6850653705911428

Here is the log using the source code on Github: Validation Dice log for SYMNet_neurite_oasis: 0:0.5297590205219744 1000:0.6597511465193313

Yet, I spotted two discrepancies between your code and the original one.

Left: your modified code (Train_sym_neurite_oasis.py) Right: the original code (Train_sym_neurite_oasis.py)

Since we use "tab" for indentation, the extra spaces in line 133 may cause an issue for some python compilers.

Try to remove the extra spaces as mentioned above, or re-download the training script and try again.

If the problem persists, I think it would be a non-trivial environmental problem and you may try it with another machine, if possible.

Tranquillar commented 2 years ago

Thank you very much for looking into it. I will try it on a different machine.

Tranquillar commented 2 years ago

Do you mind telling me which cuda and pytorch versions you are using? I want to make really sure that there are as little differences as possible.

Thanks in advance 😊

cwmok commented 2 years ago

Sure. I am using Ubuntu 16.04 LTS + Pytorch 1.9.0+cu111. The code is tested with an NVIDIA RTX 3080 GPU (Driver version: 460.84, CUDA Version: 11.2).

Tranquillar commented 2 years ago

Hi Tony, the situation is resolved. I ran the code in a docker container with the exact pytorch and cuda version you used.

Ubuntu 16.04 LTS + Pytorch 1.9.0+cu111.

Everything looks a lot better now. This is the validation score log after 55k iterations: 0 : 0.4097486243230383 5000 : 0.7047512729914296 10000 : 0.7209373161013326 15000 : 0.7476094422734847 20000 : 0.7573324792065288 25000 : 0.76341926686527 30000 : 0.7683853719520911 35000 : 0.7717567312637159 40000 : 0.7773653384358569 45000 : 0.7856832570252698 50000 : 0.7811629084686469 55000 : 0.783746416889304

Thanks again for your help. 👍 I will now start training the network with my own data.

Quick summary of the issue and solution in case anyone else experiences this:

My Setup

GPU: NVIDIA GeForce RTX 3090
PyTorch: 1.10.2
Cuda: 11.3

Symptoms:

The dice scores stay constant over multiple checkpoints or sometimes even fall close to zero.
The overall training loss becomes negative infinity or NaN.
The overall training loss starts at -0.25 then jumps up to about 4.0 within the first 1000 iterations and then slowly approaches -0.3 without ever getting better than that.

Solution:

Run the code in a docker environment with pytorch version 1.9.0 and cuda version 11.1. (The docker image I used is "pytorch/pytorch:1.9.0-cuda11.1-cudnn8-devel").

cwmok commented 2 years ago

Good to hear that. Thanks for your summary. 👍

If you are looking for a state-of-art registration method, you may also check out our latest image registration framework for medical images at https://github.com/cwmok/Conditional_LapIRN.

cwmok / Fast-Symmetric-Diffeomorphic-Image-Registration-with-Convolutional-Neural-Networks

Error during training of SYMNet with Oasis data set #10

Problem 1:

Problem 2: