Closed d1shs0ap closed 3 years ago
4 v100l should be enough. looks like it finishes all data preparation, training and almost 6 hours of inference already, so I would guess maybe within anther few hours it should finish the validation as well. But just in case, how long does the training take on your side?
[02;37m2021-08-13 13:12:10,336 - [0m[32mINFO - capreolus.trainer.tensorflow.train - starting training from iteration 1/10[0m
0%| | 0/3000 [00:00<?, ?it/s][02;37m2021-08-13 13:33:30,680 - [0m[32mINFO - capreolus.trainer.tensorflow.train - iter=1 loss = 0.13787981867790222[0m
Based on these two lines I think training takes roughly 20 minutes. Does that sound about right?
@d1shs0ap that sounds about right for one round I believe. However, the whole training takes 10 iterations, is this right @crystina-z ?
[EDIT: If that's true, then wouldn't the whole training take 10 * (6 hrs + 20 minutes), because there are 10 iters and each iter takes 20 minutes to train and 6 hours for interference?]
@nimasadri11 ah good reminding... to save the time for inference, could actually add the line reranker.trainer.validatefreq=$validatefreq
which I just updated to the doc. Thanks for pointing it out!
Then 3.5 hours (20min *10 iter) sounds reasonable to me with 4 v100l, and I would guess the inference time in this case won't be more than 12 hours (probably just 7~9 hrs). @nimasadri11 when u say 6 hrs, does it mean the inference takes like 6 hrs on your side?
@crystina-z Yes. Each time it does a validation run it takes 6.5 hours on my end. I see, so adding reranker.trainer.validatefreq=$validatefreq
and validatefreq=$niters
will make it so that it only does one validation run at the end of whole (10 rounds of) training instead of doing one at the end of each training run.
@nimasadri11 thanks for the info! and yes that's exactly what reranker.trainer.validatefreq=$validatefreq validatefreq=$niters
does
Received the following error
validation: 26128it [5:55:37, 1.23it/s][Aslurmstepd: error: *** JOB 10410402 ON cdr2651 CANCELLED AT 2021-08-15T11:17:10 DUE TO TIME LIMIT ***
with the following config:I'm planning to change the configs for more time or GPUs, but I'm not exactly sure what is the right number to put. I've read https://github.com/capreolus-ir/capreolus/blob/feature/msmarco_psg/docs/reproduction/MS_MARCO.md and https://docs.computecanada.ca/wiki/Using_GPUs_with_Slurm, but I am still unsure (was thinking of changing the number of GPUs from 4 to 6).