capreolus-ir / capreolus

A toolkit for end-to-end neural ad hoc retrieval
https://capreolus.ai
Apache License 2.0
95 stars 32 forks source link

How much GPU/time/other configs do I need for running the monoBERT MS Marco experiment? #174

Closed d1shs0ap closed 3 years ago

d1shs0ap commented 3 years ago

Received the following errorvalidation: 26128it [5:55:37, 1.23it/s]slurmstepd: error: *** JOB 10410402 ON cdr2651 CANCELLED AT 2021-08-15T11:17:10 DUE TO TIME LIMIT *** with the following config:

#!/bin/bash
#SBATCH --job-name=msmarcopsg
#SBATCH --nodes=1
#SBATCH --gres=gpu:v100l:4
#SBATCH --ntasks-per-node=1
#SBATCH --mem=120GB
#SBATCH --time=48:00:00
#SBATCH --account=your_slurm_account
#SBATCH --cpus-per-task=32
#SBATCH -o ./output.log

I'm planning to change the configs for more time or GPUs, but I'm not exactly sure what is the right number to put. I've read https://github.com/capreolus-ir/capreolus/blob/feature/msmarco_psg/docs/reproduction/MS_MARCO.md and https://docs.computecanada.ca/wiki/Using_GPUs_with_Slurm, but I am still unsure (was thinking of changing the number of GPUs from 4 to 6).

crystina-z commented 3 years ago

4 v100l should be enough. looks like it finishes all data preparation, training and almost 6 hours of inference already, so I would guess maybe within anther few hours it should finish the validation as well. But just in case, how long does the training take on your side?

d1shs0ap commented 3 years ago

[02;37m2021-08-13 13:12:10,336 - INFO - capreolus.trainer.tensorflow.train - starting training from iteration 1/10 0%| | 0/3000 [00:00<?, ?it/s]2021-08-13 13:33:30,680 - INFO - capreolus.trainer.tensorflow.train - iter=1 loss = 0.13787981867790222

Based on these two lines I think training takes roughly 20 minutes. Does that sound about right?

nimasadri11 commented 3 years ago

@d1shs0ap that sounds about right for one round I believe. However, the whole training takes 10 iterations, is this right @crystina-z ?

[EDIT: If that's true, then wouldn't the whole training take 10 * (6 hrs + 20 minutes), because there are 10 iters and each iter takes 20 minutes to train and 6 hours for interference?]

crystina-z commented 3 years ago

@nimasadri11 ah good reminding... to save the time for inference, could actually add the line reranker.trainer.validatefreq=$validatefreq which I just updated to the doc. Thanks for pointing it out!

Then 3.5 hours (20min *10 iter) sounds reasonable to me with 4 v100l, and I would guess the inference time in this case won't be more than 12 hours (probably just 7~9 hrs). @nimasadri11 when u say 6 hrs, does it mean the inference takes like 6 hrs on your side?

nimasadri11 commented 3 years ago

@crystina-z Yes. Each time it does a validation run it takes 6.5 hours on my end. I see, so adding reranker.trainer.validatefreq=$validatefreq and validatefreq=$niters will make it so that it only does one validation run at the end of whole (10 rounds of) training instead of doing one at the end of each training run.

crystina-z commented 3 years ago

@nimasadri11 thanks for the info! and yes that's exactly what reranker.trainer.validatefreq=$validatefreq validatefreq=$niters does