How much GPU needs? - Githubissues

zahrabsh74 commented 10 months ago

Hello @PatrickTUM , Thank you for your great work! I have an issue in running your code to train the model on SEN12MSCR dataset with the following code:

python train_reconstruct.py --experiment_name my_first_experiment --root3 /home/bada_za/data/uncrtaints/SEN12MSCR --model uncrtaints --input_t 3 --region all --epochs 20 --lr 0.001 --batch_size 4 --gamma 1.0 --scale_by 10.0 --trained_checkp "" --loss MGNLL --covmode diag --var_nonLinearity softplus --display_step 10 --use_sar --block_type mbconv --n_head 16 --device cuda --res_dir ./results --rdm_seed 1

I tried to train it with GPU=16 GB but I received CUDA error, therefore I am looking for how much GPU does this model needs for training on both dataset? The error I received is: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 262.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 55.06 MiB is free. Process 1048438 has 1.42 GiB memory in use. Including non-PyTorch memory, this process has 13.27 GiB memory in use. Of the allocated memory 13.10 GiB is allocated by PyTorch, and 32.70 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

PatrickTUM commented 10 months ago

Hi @zahrabsh74,

great seeing you interested in our work! If I recall correctly, the GPU memory needs for training with the cited command should be around 30 GiB. Unfortunately, this may not fit on your current device. However, here's some suggestions for optimization that may help you at a small cost of model performance:

try using --n_head 1 for a single attention head. In our ablation experiment of Table 3 we found this to fairly robust.
consider the isotropic output & loss, with just a single variance estimate for all spectral bands instead of band-wise uncertainties
you can do --batch_size 1, which may yield the greatest gains in memory but may also affect optimization

Hoping this helps you, Patrick

zahrabsh74 commented 7 months ago

Dear @PatrickTUM , Thanks for your help, I changed the batch size and it started training but as I recently have accessed to a powerful GPU server with higher space, I started training your code with the same parameters to reproduce best results. But I surprised that each epochs trained around 93 hours! I read your official paper but I didn't find any information about the your training time and also any detail about the system and the number of GPU's you have used. could you please give this additional information about your training?

Ly403 commented 6 months ago

Dear @PatrickTUM , Thanks for your help, I changed the batch size and it started training but as I recently have accessed to a powerful GPU server with higher space, I started training your code with the same parameters to reproduce best results. But I surprised that each epochs trained around 93 hours! I read your official paper but I didn't find any information about the your training time and also any detail about the system and the number of GPU's you have used. could you please give this additional information about your training?

Maybe I can solve your problem. I have recently trained this model and used the same dataset to train my own cloud removal model. And I found that the most time was used to calculate the cloud mask, namely the get_cloud_map function. So I think you can use the pre-computed npy files provided by this repo to accelerate your training process by setting the import_data_path parameter in the SEN12MSCRTS class. Hope this could help you.

PatrickTUM / UnCRtainTS

How much GPU needs? #5