Closed linhaojia13 closed 1 year ago
We just give a sample of code how it works.
Specifically, in train_it.sh, it shows
micro_batch_size=4
,nproc_per_node=8
,nnodes=1
, andgradient_accumulation_steps=1
, which results in aglobal_batch_size
of 32, rather than the 256 mentioned in the paper.To replicate the results from the paper, should I adjust
gradient_accumulation_steps
andmicro_batch_size
to align with theglobal_batch_size
mentioned in the paper, or should I directly use the train_it.sh script that you have released?
Sorry to bother youmy good friend. Do you understand how it works? Now I am also faced with this problem 😢 .
Specifically, in train_it.sh, it shows
micro_batch_size=4
,nproc_per_node=8
,nnodes=1
, andgradient_accumulation_steps=1
, which results in aglobal_batch_size
of 32, rather than the 256 mentioned in the paper.To replicate the results from the paper, should I adjust
gradient_accumulation_steps
andmicro_batch_size
to align with theglobal_batch_size
mentioned in the paper, or should I directly use the train_it.sh script that you have released?