Question on Training MaNNer model

slouvan commented 6 months ago

Hi, Thank you for the effort for creating this terrific library. I have several questions on how to run training for the MaNNeR model. 1) Before I perform the experiment using manner_module_mindsmall_plm_supconloss_bertsent_s42 I need to run training for the CR module first, correct? 2) When running the training for the CR module (manner_cr_module_mindsmall_plm_supconloss_bertsent), it seems for 1 epoch it will take around 3 hours: Epoch 0/0 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/15529 0:00:08 • 2:57:10 Is this expected? 3) In the middle of training from step 2) it will throw an error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.58 GiB total capacity; 14.02 GiB already allocated; 1.56 MiB free; 14.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF I'm using g4dn.12xlarge machine which has 15 GB memory per GPU. I'm decreasing the default batch size from 8 to 4 it seems running now but the total training time will be significantly lonnger. I wonder what settings (type of machine and config values) that you use to train the MaNNer-CR model? Did you train it using multi-GPU or not?

andreeaiana commented 6 months ago

Hi,

Before I perform the experiment using manner_module_mindsmall_plm_supconloss_bertsent_s42 I need to run training for the CR module first, correct?

Indeed, you have to train both the CR-Module and the A-Module first (i.e., in case you want to use the A-Modules).

When running the training for the CR module (manner_cr_module_mindsmall_plm_supconloss_bertsent), it seems for 1 epoch it will take around 3 hours: Epoch 0/0 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/15529 0:00:08 • 2:57:10 Is this expected?

Yes, that's the average epoch time for the CR-Module.

In the middle of training from step 2) it will throw an error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.58 GiB total capacity; 14.02 GiB already allocated; 1.56 MiB free; 14.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF I'm using g4dn.12xlarge machine which has 15 GB memory per GPU. I'm decreasing the default batch size from 8 to 4 it seems running now but the total training time will be significantly lonnger. I wonder what settings (type of machine and config values) that you use to train the MaNNer-CR model? Did you train it using multi-GPU or not?

I haven't trained it using multi-GPU, although I also tried that in the past and it speeds training a bit. I ran my models on NVIDIA A100 with 40GB per GPU for the MINDlarge dataset. If I remember correctly, for the MINDsmall dataset I ran them on a machine with NVIDIA Tesla V100 32GB GPU.

slouvan commented 6 months ago

Thanks Andreea, very helpful.

andreeaiana / newsreclib

Question on Training MaNNer model #15