DominikBatic / EndoViT

Large-scale Self-supervised Pre-training for Endoscopy
23 stars 1 forks source link

torch.cuda.OutOfMemoryError: CUDA out of memory. #1

Closed Grizemc closed 5 months ago

Grizemc commented 6 months ago

Redirected stdout to /home/cwz/EndoViT/finetuning/semantic_segmentation/output_dir/low_res/full_dataset/EndoViT/run_0017_21:01-08.04.24__LowRes_FullDataset_EndoViT_Run01_seed_1665/out.txt Redirected stderr to /home/cwz/EndoViT/finetuning/semantic_segmentation/output_dir/low_res/full_dataset/EndoViT/run_0017_21:01-08.04.24LowRes_FullDataset_EndoViT_Run01_seed_1665/err.txt Traceback (most recent call last): File "./finetuning/semantic_segmentation/model/main.py", line 127, in main() File "./finetuning/semantic_segmentation/model/main.py", line 118, in main best_result_dict = trainer.run_training() File "/home/cwz/EndoViT/finetuning/semantic_segmentation/model/src/trainer.py", line 431, in run_training general_stats, train_stats = self._train_one_epoch( File "/home/cwz/EndoViT/finetuning/semantic_segmentation/model/src/trainer.py", line 352, in _train_one_epoch logits = model(inputs) File "/home/anaconda3/envs/endovit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, kwargs) File "/home/cwz/EndoViT/finetuning/semantic_segmentation/model/DPT/dpt/models.py", line 100, in forward out = self.scratch.output_conv(path_1) File "/home/anaconda3/envs/endovit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, *kwargs) File "/home/anaconda3/envs/endovit/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward input = module(input) File "/home/anaconda3/envs/endovit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/home/anaconda3/envs/endovit/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 171, in forward return F.batch_norm( File "/home/anaconda3/envs/endovit/lib/python3.8/site-packages/torch/nn/functional.py", line 2450, in batch_norm return torch.batch_norm( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 10.75 GiB total capacity; 9.56 GiB already allocated; 248.75 MiB free; 9.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing. wandb: ๐Ÿš€ View run run_0017_21:01-08.04.24LowRes_FullDataset_EndoViT_Run01_seed_1665 at: https://wandb.ai/iezj/EndoViT_Finetuning_Segmentation/runs/9pxd55qn

DominikBatic commented 6 months ago

Hello, it seems you are running out of GPU memory. We used one Nvidia a40 for both pretraining and finetuning which has 48 GB of memory.

Try lowering the batchsize in the config ".json" files at: https://github.com/DominikBatic/EndoViT/blob/main/finetuning/semantic_segmentation/output_dir/low_res/full_dataset/EndoViT

Perhaps set it to 16 (from 64).

Let me know if that helps.

Grizemc commented 6 months ago

ๆ‚จๅฅฝ๏ผŒ ๆ‚จ็š„ GPU ๅ†…ๅญ˜ไผผไนŽไธ่ถณใ€‚ๆˆ‘ไปฌไฝฟ็”จไธ€ๅฐ Nvidia a40 ่ฟ›่กŒ้ข„่ฎญ็ปƒๅ’Œๅพฎ่ฐƒ๏ผŒๅ…ถๅ†…ๅญ˜ไธบ 48 GBใ€‚

ๅฐ่ฏ•้™ไฝŽ้…็ฝฎโ€œ.jsonโ€ๆ–‡ไปถไธญ็š„ๆ‰น้‡ๅคงๅฐ๏ผš https://github.com/DominikBatic/EndoViT/blob/main/finetuning/semantic_segmentation/output_dir/low_res/full_dataset/EndoViT

ไนŸ่ฎธๅฐ†ๅ…ถ่ฎพ็ฝฎไธบ 16๏ผˆไปŽ 64 ๅผ€ๅง‹๏ผ‰ใ€‚

่ฎฉๆˆ‘็Ÿฅ้“่ฟ™ๆ˜ฏๅฆๆœ‰ๅธฎๅŠฉใ€‚

thanks๏ผŒi try.Since some pictures were discarded during the preprocessing of my data set, I changed some values. After pre-training with bs=16, the display is as follows Traceback (most recent call last): File "./finetuning/semantic_segmentation/model/main.py", line 127, in main() File "./finetuning/semantic_segmentation/model/main.py", line 122, in main trainer.run_testing(best_result_dict["ckpt_path"]) File "/home/cwz/EndoViT/finetuning/semantic_segmentation/model/src/trainer.py", line 504, in run_testing test_stats = self.evaluate( File "/home/anaconda3/envs/endovit/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/cwz/EndoViT/finetuning/semantic_segmentation/model/src/trainer.py", line 255, in evaluate f"Auxiliary{prefix}/Example{str(current_step)}": wandb.Image(util.create_wandb_plots(self.config, inputs[what_to_plot["aux_plots"][str(current_step)]], targets[what_to_plot["aux_plots"][str(current_step)]], preds[what_to_plot["aux_plots"][str(current_step)]])) IndexError: index 21 is out of bounds for dimension 0 with size 16 wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing. wandb: wandb: Run history: wandb: train/Acc โ–โ–ƒโ–„โ–…โ–†โ–†โ–†โ–†โ–†โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ wandb: train/Dice โ–โ–‚โ–‚โ–ƒโ–ƒโ–„โ–„โ–„โ–„โ–…โ–…โ–…โ–…โ–…โ–…โ–…โ–…โ–…โ–…โ–†โ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ wandb: train/IoU โ–โ–‚โ–‚โ–ƒโ–ƒโ–„โ–„โ–„โ–„โ–…โ–…โ–…โ–…โ–…โ–…โ–…โ–…โ–…โ–…โ–…โ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ wandb: train/Observed_classes โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ– wandb: train/epoch_1000x โ–โ–โ–โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–„โ–„โ–„โ–„โ–„โ–…โ–…โ–…โ–…โ–…โ–…โ–†โ–†โ–†โ–†โ–†โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–ˆโ–ˆโ–ˆ wandb: train/max_lr โ–โ–โ–‚โ–‚โ–‚โ–ƒโ–ƒโ–„โ–„โ–„โ–…โ–…โ–…โ–†โ–†โ–†โ–‡โ–‡โ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–‡โ–‡โ–†โ–†โ–…โ–…โ–„โ–„โ–ƒโ–ƒโ–‚โ–‚โ–‚โ–โ–โ– wandb: train/min_lr โ–โ–โ–‚โ–‚โ–‚โ–ƒโ–ƒโ–„โ–„โ–„โ–…โ–…โ–…โ–†โ–†โ–†โ–‡โ–‡โ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–‡โ–‡โ–†โ–†โ–…โ–…โ–„โ–„โ–ƒโ–ƒโ–‚โ–‚โ–‚โ–โ–โ– wandb: train/train_loss โ–ˆโ–‡โ–†โ–„โ–„โ–„โ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ– wandb: val/Acc โ–โ–ˆ wandb: val/Classes โ–ˆโ– wandb: val/Dice โ–โ–ˆ wandb: val/IoU โ–โ–ˆ wandb: val/loss โ–ˆโ– wandb: wandb: Run summary: wandb: train/Acc 0.9637 wandb: train/Dice 0.88538 wandb: train/IoU 0.83814 wandb: train/Observed_classes 8 wandb: train/epoch_1000x 1981 wandb: train/max_lr 0.0 wandb: train/min_lr 0.0 wandb: train/train_loss 0.10589 wandb: val/Acc 0.96137 wandb: val/Classes 8 wandb: val/Dice 0.88094 wandb: val/IoU 0.83239 wandb: val/loss 0.20891 wandb: wandb: ๐Ÿš€ View run run_0001_21:26-08.04.24__HighRes_FullDataset_EndoViT_Run01_seed_1665 at: https://wandb.ai/iezj/EndoViT_Finetuning_Segmentation/runs/d05c4ke4 wandb: Synced 6 W&B file(s), 4 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Find logs at: ./finetuning/semantic_segmentation/output_dir/high_res/full_dataset/EndoViT/run_0001_21:26-08.04.24__HighRes_FullDataset_EndoViT_Run01_seed_1665/logs/wandb/run-20240408_212626-d05c4ke4/logs

Grizemc commented 6 months ago

ๆ‚จๅฅฝ๏ผŒ ๆ‚จ็š„ GPU ๅ†…ๅญ˜ไผผไนŽไธ่ถณใ€‚ๆˆ‘ไปฌไฝฟ็”จไธ€ๅฐ Nvidia a40 ่ฟ›่กŒ้ข„่ฎญ็ปƒๅ’Œๅพฎ่ฐƒ๏ผŒๅ…ถๅ†…ๅญ˜ไธบ 48 GBใ€‚

ๅฐ่ฏ•้™ไฝŽ้…็ฝฎโ€œ.jsonโ€ๆ–‡ไปถไธญ็š„ๆ‰น้‡ๅคงๅฐ๏ผš https://github.com/DominikBatic/EndoViT/blob/main/finetuning/semantic_segmentation/output_dir/low_res/full_dataset/EndoViT

ไนŸ่ฎธๅฐ†ๅ…ถ่ฎพ็ฝฎไธบ 16๏ผˆไปŽ 64 ๅผ€ๅง‹๏ผ‰ใ€‚

่ฎฉๆˆ‘็Ÿฅ้“่ฟ™ๆ˜ฏๅฆๆœ‰ๅธฎๅŠฉใ€‚

I also want to know what these different seeds mean. I think the json files corresponding to different seeds seem to be the same, and the super parameters are also the same, so why do I need to set different seeds? Thank you again!

DominikBatic commented 6 months ago

Hello again,

the seeds are for measuring the results. Since the intention was to publish a paper, we wanted to report the results by running three different training runs (on 3 different predefined seeds) and then report the average result (i.e. mean IoU).

The configs are otherwise the same, only the seed changed.

We do this in all of the subtasks, not just semantic segmentation. If you don't care about reproducing the results and want to just use the code you can take any of the config files.

If you want the random seed, set the seed to -1.


The code error happens because we visualized some images from the validation dataset to wandb. We were looking for very specific images to compare ourselves to the following paper:

https://ieeexplore.ieee.org/document/9871583

Therefore, we manually select the indices of these images in the code here:

https://github.com/DominikBatic/EndoViT/blob/5cb605abeb4d785fd4008c286fd6d0e704098584/finetuning/semantic_segmentation/model/src/trainer.py#L163-L185

and because the batchsize was 0-63 and now you changed it to 0-15, the indices are out-of-bounds.

If you still want to visualize some of the results, just delete all indices that are above 15 in the dictionaries. If you don't want to visualize anything leave the dictionaries empty.

Basically, the "oblig_plots" dictionary specifies the images we were looking for. The "aux_plots" are some extra images we wanted to visualize. If you look at "oblig_plots" dictionary, the line ["6": 0] means: "at batch 6 visualize the image index 0".

Hope this helps! If any other issues occur please let me know.

Grizemc commented 6 months ago

Hello again,

the seeds are for measuring the results. Since the intention was to publish a paper, we wanted to report the results by running three different training runs (on 3 different predefined seeds) and then report the average result (i.e. mean IoU).

The configs are otherwise the same, only the seed changed.

We do this in all of the subtasks, not just semantic segmentation. If you don't care about reproducing the results and want to just use the code you can take any of the config files.

If you want the random seed, set the seed to -1.

The code error happens because we visualized some images from the validation dataset to wandb. We were looking for very specific images to compare ourselves to the following paper:

https://ieeexplore.ieee.org/document/9871583

Therefore, we manually select the indices of these images in the code here:

https://github.com/DominikBatic/EndoViT/blob/5cb605abeb4d785fd4008c286fd6d0e704098584/finetuning/semantic_segmentation/model/src/trainer.py#L163-L185

and because the batchsize was 0-63 and now you changed it to 0-15, the indices are out-of-bounds.

If you still want to visualize some of the results, just delete all indices that are above 15 in the dictionaries. If you don't want to visualize anything leave the dictionaries empty.

Basically, the "oblig_plots" dictionary specifies the images we were looking for. The "aux_plots" are some extra images we wanted to visualize. If you look at "oblig_plots" dictionary, the line ["6": 0] means: "at batch 6 visualize the image index 0".

Hope this helps! If any other issues occur please let me know.

I understand! Thank you very much for your patience in recovering, thank you again! And I have one last question. I see that your data uses multiple segmented images. Are there any semantic labels for body tissues, such as language labels such as liver/gallbladder/stomach? Is this something I can add and correspond to myself? Then I looked at your code and the checkpoint file obtained after training. How should I conduct a small-scale test? Do readers need to write this part of the test themselves, or maybe it is because I didn't see it. Thanks! I hope you'll forgive me if I'm ignorant in any way.

DominikBatic commented 6 months ago

The dataset for segmentation is CholecSeg8k and can be found under this link:

Here is a screenshot of a table from their kaggle page (https://www.kaggle.com/datasets/newslab/cholecseg8k):

CholecSeg8k_kaggle

Each class is defined by 1 color value, and these color values can be found in the "watershed masks" you get when you download the dataset.

Important to note is that they had a mistake in the dataset. Instead of having 13 colors they have 15 colors in total. White and black colors shouldn't be in the images but there are a few pixels with these colors. We describe these issues and some others that we found while preprocessing the dataset here:

In summary, to get around this issue we created an additional 14th class ("error" class) and when calculating metrics we set index 13 as the index that should be ignored.

However, since we are comparing ourselves to the results of the benchmark produced in this paper:

They made a few changes to the original 13 classes, namely they combine some of them. E.g 5 rare classes were combined into 1 class (they name it "Misc." as miscellaneous) and 2 instrument classes into "Instruments" class as noted here:

benchmark

This leaves us with 8 classes in the end (or 9 counting in the error class).

In the repository at:

you can find the scripts we used to preprocess the original CholecSeg8k dataset to match the 8 classes definition.

The "CholecSeg8k_color_dict_combined_classes.py" file is a collection of dictionaries we used for mapping classes and their colors.

The "watershed_to_class_v3" dictionary maps the 13+1 original classes into the 8+1 new ones. And "class_to_color" dictionary maps the 8+1 new classes to their visualization colors.


TESTING:

The "trainer.py" file contains the code for training and testing. Both of those are run from the "main.py" script.

If you take a look at main.py file you can find this code:

https://github.com/DominikBatic/EndoViT/blob/5cb605abeb4d785fd4008c286fd6d0e704098584/finetuning/semantic_segmentation/model/main.py#L112-L120

As you can see, you first initialize a Trainer with your desired config. After training, we just run the "run_testing" function and provide it the path to the checkpoint we wish to test.


The way you define what images will be used for training/validation/testing is as follows:

We define the CholecSeg8k dataset class here:

CholecSeg8k is a subset of a larger Cholec80 dataset (which is not used for segmentation). And it contains 17 out of total 80 Cholec80 videos.

Here you can see the list of all CholecSeg8k videos, and how they are split into train/val/test datasets. (This was done exactly how original CholecSeg8k paper describes it.)

https://github.com/DominikBatic/EndoViT/blob/5cb605abeb4d785fd4008c286fd6d0e704098584/finetuning/semantic_segmentation/model/src/dataset.py#L103-L111

In the config file you can manually select which videos go where (look for "Splits" key):

"Splits": { "train_videos": [1, 9, 18, 20, 24, 25, 26, 28, 35, 37, 43, 48, 55], "val_videos": [17, 52], "test_videos": [12, 27] },

Unfortunately, if you want to use custom datasets, then you will need to write your own dataset and dataloader. Otherwise, you can create a Trainer instance with a new config file specifying the CholecSeg8k videos you wish to use for testing.

This was a lot of text, and I'm sure it will be confusing at times. If you need me to explain anything further, just ask.

Have a nice day!

Grizemc commented 6 months ago

The dataset for segmentation is CholecSeg8k and can be found under this link:

Here is a screenshot of a table from their kaggle page (https://www.kaggle.com/datasets/newslab/cholecseg8k): image_2024-04-09_123539637

Each class is defined by 1 color value, and these color values can be found in the "watershed masks" you get when you download the dataset.

Important to note is that they had a mistake in the dataset. Instead of having 12 colors they have 14 colors in total. White and black colors shouldn't be in the images but there are a few pixels with these colors. We describe these issues and some others that we found while preprocessing the dataset here:

In summary, to get around this issue we created an additional 13th class ("error" class) and when calculating metrics we set index 12 as the index that should be ignored.

However, since we are comparing ourselves to the results of the benchmark produced in this paper:

They made a few changes to the original 12 classes, namely they combine some of them. E.g 5 rare classes were combined into 1 class (they name it "Misc." as miscellaneous) and 2 instrument classes into "Instruments" class as noted here:

EndoViT_model

This leaves us with 7 classes in the end. (or 8 counting in the error class)

In the repository at:

you can find the scripts we used to preprocess the original CholecSeg8k dataset to match the 7 classes definition.

The "CholecSeg8k_color_dict_combined_classes.py" file is a collection of dictionaries we used for mapping classes and their colors.

The "watershed_to_class_v3" dictionary maps the 12+1 original classes into the 7+1 new ones. And "class_to_color" dictionary maps the 7+1 new classes to their visualization colors.

TESTING:

The "trainer.py" file contains the code for training and testing. Both of those are run from the "main.py" script.

If you take a look at main.py file you can find this code:

https://github.com/DominikBatic/EndoViT/blob/5cb605abeb4d785fd4008c286fd6d0e704098584/finetuning/semantic_segmentation/model/main.py#L112-L120

As you can see, after training, we just run the "run_testing" function and provide it the path to the checkpoint you wish to test.

The way you define what images will be used for training/validation/testing is as follows:

We define the CholecSeg8k dataset class here:

CholecSeg8k is a subset of a larger Cholec80 dataset (which is not used for segmentation). And it contains 17 out of total 80 Cholec80 videos.

https://github.com/DominikBatic/EndoViT/blob/5cb605abeb4d785fd4008c286fd6d0e704098584/finetuning/semantic_segmentation/model/src/dataset.py#L103-L111

Here you can see the list of all CholecSeg8k videos, and how they are split into train/val/test datasets. (This was done exactly how original CholecSeg8k paper describes it.)

In the config file you can manually select which videos go where (look for "Splits" key):

"Splits": { "train_videos": [1, 9, 18, 20, 24, 25, 26, 28, 35, 37, 43, 48, 55], "val_videos": [17, 52], "test_videos": [12, 27] },

Unfortunately, if you want to use custom datasets, then you will need to write your own dataset and dataloader. Otherwise, you can create a Trainer instance with a new config file specifying the CholecSeg8k videos you wish to use for testing.

This was a lot of text, and I'm sure it will be confusing at times. If you need me to explain anything further, just ask.

Have a nice day!

Thank you very much for your rigorous and detailed reply! Here, I still can't help but sigh what kind of magical author you are. At this moment, I have a deeper understanding of the qualities of German scholars. After several discussions with you, I have understood it in more detail. Thank you again for the outstanding contributions made by your team. If possible, I will cite your paper in the future. Finally, I wish you a happy life!

DominikBatic commented 6 months ago

Thank you very much!
I wish you all the best in the future!