marctessier commented 7 months ago

Had a question on on slack, copying it here to keep track / assign to @SamuelLarkin from the thread ( Thank you)

I have a question about how we are doing checkpointing ckpt_epochs: 1 in EV. For example, I have a test where I set max_epochs: 100 and training is done. Logs look like this below .

(EveryVoice) [U20-GPSC7]:$ ll logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/
total 1.5G
-rw-r----- 1 tes001 nrc_ict 209M Apr 29 16:18 'epoch=87-step=1144.ckpt'
-rw-r----- 1 tes001 nrc_ict 209M Apr 29 16:18 'epoch=93-step=1222.ckpt'
-rw-r----- 1 tes001 nrc_ict 209M Apr 29 16:18 'epoch=96-step=1261.ckpt'
-rw-r----- 1 tes001 nrc_ict 209M Apr 29 16:18 'epoch=97-step=1274.ckpt'
-rw-r----- 1 tes001 nrc_ict 209M Apr 29 16:18 'epoch=99-step=1300.ckpt'
-rw-r----- 1 tes001 nrc_ict 209M Apr 29 16:18 'epoch=99-step=1300-v1.ckpt'
-rw-r----- 1 tes001 nrc_ict 209M Apr 29 16:18  last.ckpt

I generated an audio file using that last.ckpt and this was the filename created below after synth..:

synthesis_output/wav/made-certain-recomme-9229d5cf--default--eng--ckpt\=1300--v_ckpt\=5000000--pred.wav

Question, is it a bug in everyvoice synthesize where that file name should have been something like this instead: made-certain-recomme-9229d5cf--default--eng--ckpt\=100--v_ckpt\=2500000--pred.wav

Where ckpt should be --> 100 ( or 99 depending on how you count...) and v_ckpt should be --> 2.5M since I used our Universal Vocoder ( where I thought we took it a the 2.5M checkpoint...) OR should "ckpt=" be changed for "step=" instead when creating the file name on synth to be more precise?

Sam also had this comment : so I feel it is more that we don't have the proper definition of epoch, ckpt and step

Where I agree, and I am also wondering about those numbers when using multi-gpu / nodes while training.

Sam was also wondering how many GPUs were used while training our Vocoder / are we doing gradient accumulation?. ( cause we are seeing 2.5M and 5M in the model / not sure what it the right one to use?...)

marctessier commented 7 months ago

I was re-thinking about this and for the 1300 "steps" in this example when using "steps" as the unit of measurement it is correct! It makes sense! I was mixing things up between epoch and steps / checkpoints and just making noise for nothing. I also need to run a few more side tests using multi-gpu/node to see if things are equivalent / proportional.

The only question really is why when we use the universal-2500000.ckpt vocoder. We are displaying 5M on synth --v_ckpt\=5000000 ?

roedoejet commented 7 months ago

The only question really is why when we use the universal-2500000.ckpt vocoder. We are displaying 5M on synth --v_ckpt\=5000000 ?

I think this is just because there are effectively two optimizers for GANs - one for the discriminators and one for the generator, so lightning probably just adds them both. We could divide by 2 but I don't think it's necessary really.

roedoejet commented 7 months ago

so can this be closed then?

SamuelLarkin commented 6 months ago

I think Marc's point is that the discrepancy in the naming is confusing and not obvious. I guess is you come back later to your synthesized output and that you rely on the name to figure out which vocoder was used then this is not helping.

This should be easy to fix.

SamuelLarkin commented 6 months ago

Notes

In [14]: m["loops"]["fit_loop"]
Out[14]:
{'state_dict': {},
 'epoch_loop.state_dict': {'_batches_that_stepped': 2500000},
 'epoch_loop.batch_progress': {'total': {'ready': 2500000,
   'completed': 2500000,
   'started': 2500000,
   'processed': 2500000},
  'current': {'ready': 12400,
   'completed': 12400,
   'started': 12400,
   'processed': 12400},
  'is_last_batch': False},
 'epoch_loop.scheduler_progress': {'total': {'ready': 240, 'completed': 240},
  'current': {'ready': 0, 'completed': 0}},
 'epoch_loop.batch_loop.state_dict': {},
 'epoch_loop.batch_loop.optimizer_loop.state_dict': {},
 'epoch_loop.batch_loop.optimizer_loop.optim_progress': {'optimizer': {'step': {'total': {'ready': 5000000,
     'completed': 5000000},
    'current': {'ready': 24800, 'completed': 24800}},
   'zero_grad': {'total': {'ready': 5000000,
     'completed': 5000000,
     'started': 5000000},
    'current': {'ready': 24800, 'completed': 24800, 'started': 24800}}},
  'optimizer_position': 2},
 'epoch_loop.batch_loop.manual_loop.state_dict': {},
 'epoch_loop.batch_loop.manual_loop.optim_step_progress': {'total': {'ready': 0,
   'completed': 0},
  'current': {'ready': 0, 'completed': 0}},
 'epoch_loop.val_loop.state_dict': {},
 'epoch_loop.val_loop.dataloader_progress': {'total': {'ready': 120,
   'completed': 120},
  'current': {'ready': 1, 'completed': 1}},
 'epoch_loop.val_loop.epoch_loop.state_dict': {},
 'epoch_loop.val_loop.epoch_loop.batch_progress': {'total': {'ready': 0,
   'completed': 0,
   'started': 0,
   'processed': 0},
  'current': {'ready': 36856,
   'completed': 36856,
   'started': 36856,
   'processed': 36856},
  'is_last_batch': True},
 'epoch_progress': {'total': {'ready': 121,
   'completed': 120,
   'started': 121,
   'processed': 121},
  'current': {'ready': 121,
   'completed': 120,
   'started': 121,
   'processed': 121}}}

roedoejet commented 2 months ago

I agree, it's confusing that if I set max steps to 100k that it stops after checkpoint equal to 50k steps.

EveryVoiceTTS / EveryVoice

everyvoice synthesize default file name ( ckpt or step?) #411

Notes