Closed caslix closed 4 months ago
Hi you can modify the every_n_train_steps
in config.yaml:
metrics_over_trainsteps_checkpoint:
target: pytorch_lightning.callbacks.ModelCheckpoint
params:
filename: '{epoch}-{step}'
save_weights_only: True
every_n_train_steps: 10000 #change this
Yes, I did. Reduced the value so that the training ended earlier. I'm still running tests. But the model still hasn't been preserved.
lightning: precision: 16
trainer: benchmark: True accumulate_grad_batches: 2 max_steps: 250
log_every_n_steps: 50
# val
val_check_interval: 0.5
gradient_clip_algorithm: 'norm'
gradient_clip_val: 0.5
callbacks: model_checkpoint: target: pytorch_lightning.callbacks.ModelCheckpoint params: every_n_train_steps: 100 #1000 filename: "{epoch}-{step}" save_weights_only: True metrics_over_trainsteps_checkpoint: target: pytorch_lightning.callbacks.ModelCheckpoint params: filename: '{epoch}-{step}' save_weights_only: True every_n_train_steps: 1000 #20000 # 3s/step*2w=
You can set it to 10 and have a test.
model_checkpoint:
target: pytorch_lightning.callbacks.ModelCheckpoint
params:
every_n_train_steps: 100 #1000
filename: "{epoch}-{step}"
save_weights_only: True
metrics_over_trainsteps_checkpoint:
target: pytorch_lightning.callbacks.ModelCheckpoint
params:
filename: '{epoch}-{step}'
save_weights_only: True
every_n_train_steps: 10 #20000 # change this to 10
I changed it, started it, and the training ended, but unfortunately, the model was not saved to the directory.
lightning: precision: 16
trainer:
benchmark: True
accumulate_grad_batches:
2
max_steps: 50
log_every_n_steps: 50
# val
val_check_interval: 0.5
gradient_clip_algorithm: 'norm'
gradient_clip_val: 0.5
2024-05-28 11:52:32,490-INFO: @lightning version: 1.9.3 [>=1.8 required]
2024-05-28 11:52:32,490-INFO: Configing Model
2024-05-28 11:52:32,848-INFO: LatentVisualDiffusion: Running in v-prediction mode
2024-05-28 11:52:58,321-INFO: >>> Load weights from pretrained checkpoint
2024-05-28 11:53:54,809-INFO: >>> Loaded weights from pretrained checkpoint: checkpoints/dynamicrafter_512_v1/model.ckpt
2024-05-28 11:53:54,822-INFO: Running on 1=1x1 GPUs
2024-05-28 11:53:54,823-INFO: Configing Data
2024-05-28 11:53:54,935-INFO: train, WebVid, 8
2024-05-28 11:53:54,935-INFO: Configing Trainer
2024-05-28 11:53:54,941-INFO: Caution: Saving checkpoints every n train steps without deleting. This might require some free space.
2024-05-28 11:53:55,028-INFO: Running the Loop
2024-05-28 11:53:55,028-INFO:
I think maybe you should also change lvdm/main/utils_train.py line75 to save models. I'm not sure. @caslix
I think maybe you should also change lvdm/main/utils_train.py line75 to save models. I'm not sure. @caslix
Thanks for the advice, but unfortunately, after the edits, the model is still not saved.
The saved checkpoints should be in main/your_named_dir/trainstep_checkpoints/...ckpt. @caslix
Сохраненные контрольные точки должны находиться в папке main/your_named_dir/trainstep_checkpoints/...ckpt.@caslix
Yes, I know, directories are created, but the file is not saved after completing the training. DynamiCrafter\finetune\training_512_v1.0\checkpoints\trainstep_checkpoints
In my case, another directory was also created, located under main/ and the save ckpt in that one. finetune directory only contains log files.
In my case, another directory was also created, located under main/ and the save ckpt in that one. finetune directory only contains log files.
You are absolutely right! Yes, I found it, the saving takes place in the main directory. Thanks more, I have everything saved!
Hello! I am engaged in training and fine-tuning your model. I don't quite understand where the model is saved after training? The learning process is completed, the directories (\training_512_v1.0\checkpoints\trainstep_checkpoints) for saving the model have appeared, but the model itself has not. I tried a different number of epochs of steps.
warnings.warn( /home/user/.local/lib/python3.10/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True). warnings.warn( Epoch 124: 100%|█| 4/4 [06:17<00:00, 94.48s/it, loss=0.083, v_num=18, train/loss_simple_step=0.102, train/loss_vlb_step=
Trainer.fit
stopped:max_steps=250
reached. Epoch 124: 100%|█| 4/4 [06:17<00:00, 94.48s/it, loss=0.083, v_num=18, train/loss_simple_step=0.102, train/loss_vlb_stccccccacccascacaslcascaThanks for any hint or information!