Closed slouvan closed 6 months ago
Hi,
- Before I perform the experiment using
manner_module_mindsmall_plm_supconloss_bertsent_s42
I need to run training for the CR module first, correct?
Indeed, you have to train both the CR-Module and the A-Module first (i.e., in case you want to use the A-Modules).
- When running the training for the CR module (
manner_cr_module_mindsmall_plm_supconloss_bertsent
), it seems for 1 epoch it will take around 3 hours:Epoch 0/0 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/15529 0:00:08 • 2:57:10
Is this expected?
Yes, that's the average epoch time for the CR-Module.
- In the middle of training from step 2) it will throw an error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.58 GiB total capacity; 14.02 GiB already allocated; 1.56 MiB free; 14.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I'm using g4dn.12xlarge machine which has 15 GB memory per GPU. I'm decreasing the default batch size from 8 to 4 it seems running now but the total training time will be significantly lonnger. I wonder what settings (type of machine and config values) that you use to train the MaNNer-CR model? Did you train it using multi-GPU or not?
I haven't trained it using multi-GPU, although I also tried that in the past and it speeds training a bit. I ran my models on NVIDIA A100 with 40GB per GPU for the MINDlarge dataset. If I remember correctly, for the MINDsmall dataset I ran them on a machine with NVIDIA Tesla V100 32GB GPU.
Thanks Andreea, very helpful.
Hi, Thank you for the effort for creating this terrific library. I have several questions on how to run training for the MaNNeR model. 1) Before I perform the experiment using
manner_module_mindsmall_plm_supconloss_bertsent_s42
I need to run training for the CR module first, correct? 2) When running the training for the CR module (manner_cr_module_mindsmall_plm_supconloss_bertsent
), it seems for 1 epoch it will take around 3 hours:Epoch 0/0 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/15529 0:00:08 • 2:57:10
Is this expected? 3) In the middle of training from step 2) it will throw an error:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.58 GiB total capacity; 14.02 GiB already allocated; 1.56 MiB free; 14.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I'm using g4dn.12xlarge machine which has 15 GB memory per GPU. I'm decreasing the default batch size from 8 to 4 it seems running now but the total training time will be significantly lonnger. I wonder what settings (type of machine and config values) that you use to train the MaNNer-CR model? Did you train it using multi-GPU or not?