Closed liuquangao closed 7 months ago
If I've got anything wrong about how I've understood your training setup or methods, I'd really be thankful if you could set me straight.
I think, this might be a confusion that: i) we use 8 GPU training and the batch_size parameter is per GPU, ii) our aggregate_k_gradients aggregates seperate batches in our repo logic, which are the same batch in the optimizer logic, which we write about, though.
Does that help?
Thank you very much. Now I get it.
Hello: I have recently had the opportunity to delve into both the code and paper you have published, specifically focusing on the details provided in section E.3 regarding the training of the final model.
In section E.3, it is mentioned that the model was trained for 18,000 steps with a batch size of 512 datasets, totaling 9,216,000 synthetically generated datasets.
However, while reviewing the PriorFittingCustomPrior.ipynb, I noticed a difference in the configuration settings, which suggests a total of 26,214,400 synthetically generated datasets based on the provided batch size and number of steps (51200*512)
num_steps = 128 steps:128400=51200 batchsize:648(gpus)=512 synthetically generated datasets:51200*512=26214400
Would it be possible for you to share the precise training settings used for your final model? Understanding the exact parameters would greatly aid in aligning my research for a fair and meaningful comparison.
Best regards, liuquangao