Open stas00 opened 2 years ago
I want to work on this
From what I see, this has a lot of intricacies: ramp_up_batch_size, curriculum_learning (changine sequence length) so my suggestion is to compute as time_per_iter * (total_tokens - consumed_tokens) /86400
However, I am not sure on how to get total number of tokens
Also, I see an argument --train-tokens but I don't think this is being used anywhere. (I could be wrong)
I have to calculate the ETA for finishing training often enough that I think it should be a feature.
How about we log the ETA along
elapsed time per iteration
?This is just current
elapsed_time_per_iteration * (total_iterations-current_iteration) / (60*60*24)
- I think in days is most practical.I don't remember if we have
total_iterations
- perhaps thentotal_samples-consumed_samples
. and then divided bybs
.But given that we support Curriculum Learning it should be
`total_tokens-consumed_tokes
then divided byseqlen*bs
Could also keep a running average for say the last 10-100 iterations or something of sorts. Note that the first iteration on resume is usually much slower than the rest. Also during BS rampup things are quite inefficient, so history is not a good indicator of a future. But even estimating based on the last iteration is fine, since that number is usually very steady through the same run. But will change from run to run since JZ is inconsistent.
Could also log it separately once in say 50 or 100 iterations, since logging it in every iteration could be too noisy. Not sure.
Those are just different ideas, either of those will be better than manual calculation.
Hope someone will be interested in experimenting and making a PR.
Thank you!
p.s. if you have seen the current log, it looks like this (the token counters aren't logged and only go to TB, but they are there).