caslix commented 4 months ago

Hello! I am engaged in training and fine-tuning your model. I don't quite understand where the model is saved after training? The learning process is completed, the directories (\training_512_v1.0\checkpoints\trainstep_checkpoints) for saving the model have appeared, but the model itself has not. I tried a different number of epochs of steps.

warnings.warn( /home/user/.local/lib/python3.10/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True). warnings.warn( Epoch 124: 100%|█| 4/4 [06:17<00:00, 94.48s/it, loss=0.083, v_num=18, train/loss_simple_step=0.102, train/loss_vlb_step=Trainer.fit stopped: max_steps=250 reached. Epoch 124: 100%|█| 4/4 [06:17<00:00, 94.48s/it, loss=0.083, v_num=18, train/loss_simple_step=0.102, train/loss_vlb_stccccccacccascacaslcasca

Thanks for any hint or information!

Doubiiu commented 4 months ago

Hi you can modify the every_n_train_steps in config.yaml:

    metrics_over_trainsteps_checkpoint:
      target: pytorch_lightning.callbacks.ModelCheckpoint
      params:
        filename: '{epoch}-{step}'
        save_weights_only: True
        every_n_train_steps: 10000 #change this

caslix commented 4 months ago

Yes, I did. Reduced the value so that the training ended earlier. I'm still running tests. But the model still hasn't been preserved.

lightning: precision: 16

strategy: deepspeed_stage_2

trainer: benchmark: True accumulate_grad_batches: 2 max_steps: 250

logger

log_every_n_steps: 50
# val
val_check_interval: 0.5
gradient_clip_algorithm: 'norm'
gradient_clip_val: 0.5

callbacks: model_checkpoint: target: pytorch_lightning.callbacks.ModelCheckpoint params: every_n_train_steps: 100 #1000 filename: "{epoch}-{step}" save_weights_only: True metrics_over_trainsteps_checkpoint: target: pytorch_lightning.callbacks.ModelCheckpoint params: filename: '{epoch}-{step}' save_weights_only: True every_n_train_steps: 1000 #20000 # 3s/step*2w=

Doubiiu commented 4 months ago

You can set it to 10 and have a test.

model_checkpoint:
target: pytorch_lightning.callbacks.ModelCheckpoint
params:
every_n_train_steps: 100 #1000
filename: "{epoch}-{step}"
save_weights_only: True
metrics_over_trainsteps_checkpoint:
target: pytorch_lightning.callbacks.ModelCheckpoint
params:
filename: '{epoch}-{step}'
save_weights_only: True
every_n_train_steps: 10 #20000 # change this to 10

caslix commented 4 months ago

I changed it, started it, and the training ended, but unfortunately, the model was not saved to the directory.

Settings:

lightning: precision: 16

strategy: deepspeed_stage_2

trainer: benchmark: True accumulate_grad_batches: 2 max_steps: 50

logger

log_every_n_steps: 50
# val
val_check_interval: 0.5
gradient_clip_algorithm: 'norm'
gradient_clip_val: 0.5

callbacks: model_checkpoint: target: pytorch_lightning.callbacks.ModelCheckpoint params: every_n_train_steps: 100 #1000 filename: "{epoch}-{step}" save_weights_only: True metrics_over_trainsteps_checkpoint: target: pytorch_lightning.callbacks.ModelCheckpoint params: filename: '{epoch}-{step}' save_weights_only: True every_n_train_steps: 10 #20000 # 3s/step*2w= batch_logger: target: callbacks.ImageLogger params: batch_frequency: 500 to_local: False max_images: 8 log_images_kwargs: ddim_steps: 50 unconditional_guidance_scale: 7.5 timestep_spacing: uniform_trailing guidance_rescale: 0.7

Logs:

2024-05-28 11:52:32,490-INFO: @lightning version: 1.9.3 [>=1.8 required] 2024-05-28 11:52:32,490-INFO: Configing Model 2024-05-28 11:52:32,848-INFO: LatentVisualDiffusion: Running in v-prediction mode 2024-05-28 11:52:58,321-INFO: >>> Load weights from pretrained checkpoint 2024-05-28 11:53:54,809-INFO: >>> Loaded weights from pretrained checkpoint: checkpoints/dynamicrafter_512_v1/model.ckpt 2024-05-28 11:53:54,822-INFO: Running on 1=1x1 GPUs 2024-05-28 11:53:54,823-INFO: Configing Data 2024-05-28 11:53:54,935-INFO: train, WebVid, 8 2024-05-28 11:53:54,935-INFO: Configing Trainer 2024-05-28 11:53:54,941-INFO: Caution: Saving checkpoints every n train steps without deleting. This might require some free space. 2024-05-28 11:53:55,028-INFO: Running the Loop 2024-05-28 11:53:55,028-INFO: 2024-05-28 11:53:58,493-INFO: @Training [1516] Full Paramters. 2024-05-28 11:53:58,494-INFO: @Training [51] Paramters for Image_proj_model.

Yutong18 commented 4 months ago

I think maybe you should also change lvdm/main/utils_train.py line75 to save models. I'm not sure. @caslix

caslix commented 4 months ago

I think maybe you should also change lvdm/main/utils_train.py line75 to save models. I'm not sure. @caslix

Thanks for the advice, but unfortunately, after the edits, the model is still not saved.

Yutong18 commented 4 months ago

The saved checkpoints should be in main/your_named_dir/trainstep_checkpoints/...ckpt. @caslix

caslix commented 4 months ago

Сохраненные контрольные точки должны находиться в папке main/your_named_dir/trainstep_checkpoints/...ckpt.@caslix

Yes, I know, directories are created, but the file is not saved after completing the training. DynamiCrafter\finetune\training_512_v1.0\checkpoints\trainstep_checkpoints

Yutong18 commented 4 months ago

In my case, another directory was also created, located under main/ and the save ckpt in that one. finetune directory only contains log files.

caslix commented 4 months ago

In my case, another directory was also created, located under main/ and the save ckpt in that one. finetune directory only contains log files.

You are absolutely right! Yes, I found it, the saving takes place in the main directory. Thanks more, I have everything saved!

Doubiiu / DynamiCrafter

The model is not saved after training #85

strategy: deepspeed_stage_2

logger

Settings:

strategy: deepspeed_stage_2

logger

Logs: