Training time for pre-training and fine-tuning

hanoonaR commented 1 year ago

Hi Authors,

Thank you for sharing this amazing work! Could you please provide information on how many GPUs are used for the training and the training time for the pretraining and finetuning stage, individually?

Thank you.

RozDavid commented 1 year ago

Hey @hanoonaR,

I did most of my experiments on two A6000s, where it takes roughly 2 days to converge for both the pretraining and finetuning stage. If you have less GPU capacity it should be okay to do gradient accumulation for multiple sub-batches, but in that case your training time will increase accordingly.

Hope this helps, David

hanoonaR commented 1 year ago

Hi @RozDavid,

Thank you for the prompt response to the question. I would really appreciate it if you can provide information on a few more questions. 1) What is the BATCH_SIZE that must be passed in the script if two GPU's are used for the training? Also, is this BATCH SIZE specific to per GPU or overall batch size? 2) What would be the suitable hyperparameters to distribute the training to more GPUs (eg. 8 GPUs) instead of the default 2 GPUs? Should the batch size and learning rate be scaled linearly? 3) Are the numbers reported in the paper on the test set of Scannet V2, or do you report on the validation split? Also, if the numbers are reported on the test set, are the models trained on a combination of train + val set?

Thank you.

RozDavid commented 1 year ago

Hey @hanoonaR,

Thanks for reaching out!

1) The batch size is the per GPU batch size following the convention of Pytorch Lightning, so your actual batch size is the num_gpu x batch_size. For the model weights I uploaded the complete checkpoints, which includes the full set of hyperparameters so everybody could replicate the exact same results. Please check them here and after torch.load you can print all the parameters.

2) Generally speaking for contrastive pretrainings the larger batch size you have the better results you will obtain (probably within reason ofc ^^) . So if you have the compute available I would strongly advise to use all your available GPUs. For this project we didn't have the resources for a proper hyperparameter sweep for larger GPU numbers, but I would think having the defaults should be a good start.

3) The numbers in the paper are trained on the standard train set of ScanNet and evaluated on validation, while the results on the benchmark page are test set results, but also trained only on the train set.

Hope this helps, but let me know if there is anything else!

Kind regards, David

hanoonaR commented 1 year ago

Thank you for the detailed answers.

RozDavid / LanguageGroundedSemseg

Training time for pre-training and fine-tuning #17