Open rlenain opened 1 month ago
specify --checkout, but the step epoch will start from 0
do you mean --checkpoint
? I've tried that, but it doesn't work. I think the LR scheduler being different makes the training loss go kind of crazy and in the end actually not returning to the same place
do you mean
--checkpoint
? I've tried that, but it doesn't work. I think the LR scheduler being different makes the training loss go kind of crazy and in the end actually not returning to the same place
Hello, have you solved this problem? I also specified --checkpoint, and the learning rate of training also started from 0
I think the author will update the code for this issue, however I have a temporary solution you can refer to.
When starting a new training session, you need to change the path of the last checkpoint in run.sh
.
Example: --checkpoint CosyVoice/examples/libritts/cosyvoice/exp/cosyvoice/llm/torch_ddp/epoch_2_whole.pt \
Edit a little code in the file CosyVoice/cosyvoice/bin/train.py
# Save init checkpoints
info_dict = deepcopy(configs['train_conf'])
save_model(model, 'init', info_dict)
current_epoch = info_dict['current_epoch'] # add
current_step = info_dict['current_step'] # add
# Get executor
executor = Executor()
# Start training loop
for epoch in range(current_epoch, info_dict['max_epoch']): # change
executor.epoch = epoch
executor.step = current_step # add
train_dataset.set_epoch(epoch)
dist.barrier()
group_join = dist.new_group(backend="gloo", timeout=datetime.timedelta(seconds=args.timeout))
executor.train_one_epoc(model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, group_join)
dist.destroy_process_group(group_join)
3. Add config in file: `CosyVoice/examples/libritts/cosyvoice/conf/cosyvoice.yaml`
Note: current_epoch and current_step correspond to the checkpoint you change in the `run.sh` file
Example:
train_conf: optim: adam optim_conf: lr: 0.001 # change to 1e-5 during sft scheduler: warmuplr # change to constantlr during sft scheduler_conf: warmup_steps: 2500 max_epoch: 200 grad_clip: 5 accum_grad: 2 log_interval: 100 save_per_step: 1500 current_epoch: 2 # add current_step: 1311 # add
I think the author will update the code for this issue, however I have a temporary solution you can refer to.
- When starting a new training session, you need to change the path of the last checkpoint in
run.sh
. Example:--checkpoint CosyVoice/examples/libritts/cosyvoice/exp/cosyvoice/llm/torch_ddp/epoch_2_whole.pt \
- Edit a little code in the file
CosyVoice/cosyvoice/bin/train.py
# Save init checkpoints info_dict = deepcopy(configs['train_conf']) save_model(model, 'init', info_dict) current_epoch = info_dict['current_epoch'] # add current_step = info_dict['current_step'] # add # Get executor executor = Executor() # Start training loop for epoch in range(current_epoch, info_dict['max_epoch']): # change executor.epoch = epoch executor.step = current_step # add train_dataset.set_epoch(epoch) dist.barrier() group_join = dist.new_group(backend="gloo", timeout=datetime.timedelta(seconds=args.timeout)) executor.train_one_epoc(model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, group_join) dist.destroy_process_group(group_join)
- Add config in file:
CosyVoice/examples/libritts/cosyvoice/conf/cosyvoice.yaml
Note: current_epoch and current_step correspond to the checkpoint you change in the
run.sh
fileExample:
train_conf: optim: adam optim_conf: lr: 0.001 # change to 1e-5 during sft scheduler: warmuplr # change to constantlr during sft scheduler_conf: warmup_steps: 2500 max_epoch: 200 grad_clip: 5 accum_grad: 2 log_interval: 100 save_per_step: 1500 current_epoch: 2 # add current_step: 1311 # add
Thank you, after modifying the code you provided, the checkpoint can be loaded normally.
I thought the above code would make step
go back to current_step
when starting a new epoch. I modified it a bit.
# Save init checkpoints
info_dict = deepcopy(configs['train_conf'])
save_model(model, 'init', info_dict)
# ADD
current_epoch = info_dict['current_epoch']
current_step = info_dict['current_step']
start_session = True
# Get executor
executor = Executor()
# Start training loop
for epoch in range(current_epoch, info_dict['max_epoch']): # change
executor.epoch = epoch
if start_session:
executor.step = current_step # add
train_dataset.set_epoch(epoch)
dist.barrier()
group_join = dist.new_group(backend="gloo", timeout=datetime.timedelta(seconds=args.timeout))
executor.train_one_epoc(model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, group_join)
dist.destroy_process_group(group_join)
start_session = False
I thought the above code would make
step
go back tocurrent_step
when starting a new epoch. I modified it a bit.# Save init checkpoints info_dict = deepcopy(configs['train_conf']) save_model(model, 'init', info_dict) # ADD current_epoch = info_dict['current_epoch'] current_step = info_dict['current_step'] start_session = True # Get executor executor = Executor() # Start training loop for epoch in range(current_epoch, info_dict['max_epoch']): # change executor.epoch = epoch if start_session: executor.step = current_step # add train_dataset.set_epoch(epoch) dist.barrier() group_join = dist.new_group(backend="gloo", timeout=datetime.timedelta(seconds=args.timeout)) executor.train_one_epoc(model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, group_join) dist.destroy_process_group(group_join) start_session = False
Thanks, you are right. Do I need to modify the current_epoch and current_step values in cosyvoice.fromscratch.yaml every time I resume training from a checkpoint?
Yes, you need to modify the current_epoch
and current_step
values in cosyvoice.fromscratch.yaml
every time you resume training from a checkpoint.
You can see current_epoch
and current_step
in the checkpoint filename.
Example: epoch_1_step_33000.pt
Yes, you need to modify the
current_epoch
andcurrent_step
values incosyvoice.fromscratch.yaml
every time you resume training from a checkpoint.You can see
current_epoch
andcurrent_step
in the checkpoint filename. Example:epoch_1_step_33000.pt
Hello, sorry to bother you again. I want to confirm whether the learning rate will drop normally after restoring the checkpoint in your code above, because after I reply according to your code, although the number of restored steps is correct, the learning rate still starts to warm up again.
Hello,
I was wondering whether there was an easy to restart training from a checkpoint, resuming the steps, epochs, optimizer states, etc. This is in case training dies, but we want to restart training not from epoch 0.
Thanks