CUDA out of memory - Githubissues

CarolRameder commented 1 year ago

I am experimenting with the Endometrial_POLE dataset with an added "Patients_per_Image" file. This was created according to the instructions in the README file on GitHub, with the scope to use multiple images for each patient. I attached the files, in case the issue is caused by the content of these files.

Patient_to_Image.xlsx Image_Labels.xlsx

The first issue concerns the classes used to train Naronet. In this context, each patient - from the 12 selected in the cohort - has 4 classes assigned to his/her knowledge graph, one for each classification task. However, the two following lines in Naronet.py select only the second label. Is this correct? Why so?

    self.Train_indices = [self.IndexAndClass[i][1] for i in self.Train_indices]
    self.Test_indices = [self.IndexAndClass[i][1] for i in self.Test_indices]

I also observed that the set of training and test indices are always the same, as shown in the attached image. This raises the following issue: the class assigned to the patients selected in the training set for the second classification task, saved in the y_train variable, is always 1. This raises the following error:

File "/home/carol/NaroNet-main/NaroNet-main/src/NaroNet/NaroNet.py", line 204, in initialize_fold self.Trainindices, = ros.fit_resample(x_trainn, y_trainn)

ValueError: The target 'y' needs to have more than 1 class. I got 1 class instead outputs

    x_trainn = np.expand_dims(np.array(self.Train_indices),1)
    y_trainn = [self.labels[i][0] for i in self.Train_indices]

y-trainn and x_trainn for context. I added these two variables for clarity, the functionality is exactly the same.

To get past this issue I set the indices of training and testing instances differently by hand, so there will be at least an instance with class 0 in both sets.

    self.Train_indices = [0, 1, 2, 3, 4, 6]
    self.Test_indices = [5, 7, 8, 9, 10, 11]

This leads me to the second issue, shown below. For this experiment, I am using a server with two GPUs. I tried using both of them on two separate runs but received the same error. The model of both GPUs is Nvidia RTX A6000 with 48GB RAM, which is higher than the hardware mentioned in the paper with 11GB RAM. As I checked the code, no Data Parallelization method was implemented.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 618.00 MiB (GPU 1; 47.54 GiB total capacity; 45.74 GiB already allocated; 189.12 MiB free; 45.97 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

This first appeared on the following line:

File "/home/carol/NaroNet-main/NaroNet-main/src/NaroNet/NaroNet_model/GNN.py", line 440, in MLPintoFeatures x = F.relu(conv0(x))

I added these lines at the beginning of GNN.py:

import os os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:512' import torch torch.cuda.empty_cache()

Now the error appears at:

File "/home/carol/NaroNet-main/NaroNet-main/src/NaroNet/NaroNet_model/GNN.py", line 443, in MLPintoFeatures x = F.relu(conv0(x))

djimenezsanchez commented 1 year ago

I have checked the .xlsx files and Image_Labels.xlsx has duplicated patients.

Your file:

Corrected:

djimenezsanchez commented 1 year ago

Let me answer your questions one by one. See your questions in cursive:

However, the two following lines in Naronet.py select only the second label. Is this correct? Why so?

The structure of self.IndexAndClass is the following:

This line selects the indices of used in IndexAndClass list. This is done to keep consistency. self.Train_indices = [self.IndexAndClass[i][1] for i in self.Train_indices]

_I also observed that the set of training and test indices are always the same, as shown in the attached image. This raises the following issue: the class assigned to the patients selected in the training set for the second classification task, saved in the ytrain variable, is always 1. This raises the following error:

This error may be raised depending on the number of fold used. In the paper we did a leave-one-out strategy (i.e., folds=12). This way in the training set there would always be 11 patients and therefore the both classes included.

djimenezsanchez commented 1 year ago

With respect to the CUDA out of memory error, yes, we ran it on an 11GB GPU.

You can use the linux command nvidia-smi to check if there are "dead" processes in the GPU. Make sure that the GPU is empty prior to running any code.

I'm repeating this experiment joining the images of each patient in a graph. I will let you know if i find any issues related to memory.

djimenezsanchez commented 1 year ago

I ran the code creating a graph of image fields for each patient using the endometrial dataset. It works with the following Image_Labels.xlsx format:

Effectively, this experiment gives a CUDA out-of-memory error. To reduce the computation memory burden the best strategy is increasing the patch size.

In the paper, we didn't do this experiment exactly like this, instead, we performed patient-wise predictions from the mean prediction value of all images. See the paragraph below:

Patient-wise quantification. Finally, to test NaroNet’s predictive power classifying subjects, i.e. patients and not individual images, based on the POLE mutation, we performed a leave-one-out experiment: iteratively, 11 patients (represented by all their images) were used to train the model and one patient was used for testing. Patient-wise predictions were calculated as the mean prediction value of all images that correspond to the test patient, achieving an overall accuracy of 83.33% with a 95% CI [63.02–100.00%], and an AUC of 0.67 with a 95% CI [0.32-1].

Still, NaroNet has been used to join images by patient, as can be seen in the Breast cancer dataset and this publication (https://www.nature.com/articles/s41746-023-00795-x)

CarolRameder commented 1 year ago

Let me answer your questions one by one. See your questions in cursive:

However, the two following lines in Naronet.py select only the second label. Is this correct? Why so?

The structure of self.IndexAndClass is the following:

This line selects the indices of used in IndexAndClass list. This is done to keep consistency. self.Train_indices = [self.IndexAndClass[i][1] for i in self.Train_indices]

_I also observed that the set of training and test indices are always the same, as shown in the attached image. This raises the following issue: the class assigned to the patients selected in the training set for the second classification task, saved in the ytrain variable, is always 1. This raises the following error:

This error may be raised depending on the number of fold used. In the paper we did a leave-one-out strategy (i.e., folds=12). This way in the training set there would always be 11 patients and therefore the both classes included.

The default value of the folds is 10 for the Endometrial Pole dataset (i.e., args['folds'] = 10 in DatasetParameters.py ). Should I change this value to 12 from now on?

djimenezsanchez commented 1 year ago

The default value of the folds is 10 for the Endometrial Pole dataset (i.e., args['folds'] = 10 in DatasetParameters.py ). Should I change this value to 12 from now on?

Yes, the default value of folds is for the image-level experiment and is set to 10. If you want to perform a leave-on-out on the endometrial dataset (N=12) you can set it to 12.

djimenezsanchez / NaroNet

CUDA out of memory #6