Open marctessier opened 7 months ago
I was re-thinking about this and for the 1300 "steps" in this example when using "steps" as the unit of measurement it is correct! It makes sense! I was mixing things up between epoch and steps / checkpoints and just making noise for nothing. I also need to run a few more side tests using multi-gpu/node to see if things are equivalent / proportional.
The only question really is why when we use the universal-2500000.ckpt
vocoder. We are displaying 5M on synth --v_ckpt\=5000000
?
The only question really is why when we use the
universal-2500000.ckpt
vocoder. We are displaying 5M on synth--v_ckpt\=5000000
?
I think this is just because there are effectively two optimizers for GANs - one for the discriminators and one for the generator, so lightning probably just adds them both. We could divide by 2 but I don't think it's necessary really.
so can this be closed then?
I think Marc's point is that the discrepancy in the naming is confusing and not obvious. I guess is you come back later to your synthesized output and that you rely on the name to figure out which vocoder was used then this is not helping.
This should be easy to fix.
In [14]: m["loops"]["fit_loop"]
Out[14]:
{'state_dict': {},
'epoch_loop.state_dict': {'_batches_that_stepped': 2500000},
'epoch_loop.batch_progress': {'total': {'ready': 2500000,
'completed': 2500000,
'started': 2500000,
'processed': 2500000},
'current': {'ready': 12400,
'completed': 12400,
'started': 12400,
'processed': 12400},
'is_last_batch': False},
'epoch_loop.scheduler_progress': {'total': {'ready': 240, 'completed': 240},
'current': {'ready': 0, 'completed': 0}},
'epoch_loop.batch_loop.state_dict': {},
'epoch_loop.batch_loop.optimizer_loop.state_dict': {},
'epoch_loop.batch_loop.optimizer_loop.optim_progress': {'optimizer': {'step': {'total': {'ready': 5000000,
'completed': 5000000},
'current': {'ready': 24800, 'completed': 24800}},
'zero_grad': {'total': {'ready': 5000000,
'completed': 5000000,
'started': 5000000},
'current': {'ready': 24800, 'completed': 24800, 'started': 24800}}},
'optimizer_position': 2},
'epoch_loop.batch_loop.manual_loop.state_dict': {},
'epoch_loop.batch_loop.manual_loop.optim_step_progress': {'total': {'ready': 0,
'completed': 0},
'current': {'ready': 0, 'completed': 0}},
'epoch_loop.val_loop.state_dict': {},
'epoch_loop.val_loop.dataloader_progress': {'total': {'ready': 120,
'completed': 120},
'current': {'ready': 1, 'completed': 1}},
'epoch_loop.val_loop.epoch_loop.state_dict': {},
'epoch_loop.val_loop.epoch_loop.batch_progress': {'total': {'ready': 0,
'completed': 0,
'started': 0,
'processed': 0},
'current': {'ready': 36856,
'completed': 36856,
'started': 36856,
'processed': 36856},
'is_last_batch': True},
'epoch_progress': {'total': {'ready': 121,
'completed': 120,
'started': 121,
'processed': 121},
'current': {'ready': 121,
'completed': 120,
'started': 121,
'processed': 121}}}
I agree, it's confusing that if I set max steps to 100k that it stops after checkpoint equal to 50k steps.
Had a question on on slack, copying it here to keep track / assign to @SamuelLarkin from the thread ( Thank you)
I have a question about how we are doing checkpointing
ckpt_epochs: 1
in EV. For example, I have a test where I setmax_epochs: 100
and training is done. Logs look like this below .I generated an audio file using that last.ckpt and this was the filename created below after synth..:
synthesis_output/wav/made-certain-recomme-9229d5cf--default--eng--ckpt\=1300--v_ckpt\=5000000--pred.wav
Question, is it a bug in everyvoice synthesize where that file name should have been something like this instead:
made-certain-recomme-9229d5cf--default--eng--ckpt\=100--v_ckpt\=2500000--pred.wav
Where
ckpt
should be --> 100 ( or 99 depending on how you count...) andv_ckpt
should be --> 2.5M since I used our Universal Vocoder ( where I thought we took it a the 2.5M checkpoint...) OR should "ckpt=
" be changed for "step=
" instead when creating the file name on synth to be more precise?Sam also had this comment : so I feel it is more that we don't have the proper definition of epoch, ckpt and step
Where I agree, and I am also wondering about those numbers when using multi-gpu / nodes while training.
Sam was also wondering how many GPUs were used while training our Vocoder / are we doing gradient accumulation?. ( cause we are seeing 2.5M and 5M in the model / not sure what it the right one to use?...)