MitoEM training killed - Githubissues

Mohinta2892 commented 10 months ago

Hi Daniel,

I am trying to train MitoEM via Biapy following your guidelines in the docs. However, the training gets killed at the below phase:

0) Loading train images . . .
Loading data from /mnt/mito-data/data_organised_biapy/MitoEM-R/train/x_BC_thick
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:52<00:00,  9.59it/s]
*** Loaded data shape is (128000, 256, 256, 1)
1) Loading train GT . . .
Loading data from /mnt/mito-data/data_organised_biapy/MitoEM-R/train/y_BC_thick
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [01:06<00:00,  7.52it/s]
*** Loaded data shape is (128000, 256, 256, 2)
Creating validation data
Not all samples seem to have the same shape. Number of samples: 115200
*** Loaded train data shape is: (115200, 256, 256, 1)
*** Loaded train GT shape is: (115200, 256, 256, 2)
*** Loaded validation data shape is: (12800, 256, 256, 1)
*** Loaded validation GT shape is: (12800, 256, 256, 2)
### END LOAD ###
2) Loading test images . . .
Loading data from /mnt/mito-data/data_organised_biapy/MitoEM-R/test/x_BC_thick
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:44<00:00, 11.30it/s]
*** Loaded data shape is (500, 4096, 4096, 1)
########################
#  PREPARE GENERATORS  #
########################

Initializing train data generator . . .
Killed

This is how I run it:

python main.py --config /installations/BiaPy/templates/instance_segmentation/2d_instance_segmentation.yaml --result_dir /mnt/mito-data --name atten_unet_mito_r --run_id 1 --gpu 0

I am running it through a docker image that I have built myself and it can definitely access tensorflow gpus. I have a 24GB RTX 3090 and 128GB RAM. I also have reduced the batch_size during training from default 6 to 2.

Any clue what might be happening?

Best, Samia

danifranco commented 10 months ago

Hello!

It seems that the OS killed your process most likely due to memory problems. You can try to set DATA.TRAIN.IN_MEMORY to False and DATA.EXTRACT_RANDOM_PATCH to False so the training data is not loaded in memory (but load images on the fly) and a random patch of the specified shape (DATA.PATCH_SIZE) is extracted from each loaded image, respectively.

Furthermore, we are now working to move BiaPy from TF to Pytorch so we hope we could reduce a bit more the memory usage apart from other advantages.

Thank you for using BiaPy!

Best regards,

Dani

danifranco commented 9 months ago

Hello!

I've just pushed some changes (commit: 27120f5a7) that reduce the memory usage in the workflows. It can make a huge difference in big datasets as MitoEM so maybe now you can train the model.

Best,

Dani

Mohinta2892 commented 9 months ago

Hi Dani,

Many thanks, I will try with the latest pushed changes. And close the issue if all works well!

Best wishes, Samia

danifranco commented 9 months ago

Hello again,

I've made a few more changes to save memory. I've set TEST.REDUCE_MEMORY to True, which means we'll use float16 instead of float32 for model predictions and some other data. This will help save memory, especially when working with large images like MitoEM. It might make predictions a bit less accurate, but I think that the impact shouldn't be noticeable.

Feel free to close the issue.

Best,

Dani

Mohinta2892 commented 9 months ago

Thanks Dani!!

Best, Samia

BiaPyX / BiaPy

MitoEM training killed #27