Trained checkpoints & README instructions

v-i-s-h commented 2 months ago

Hi, Thanks for open sourcing the code. Could you please also share the trained checkpoints and update the README on how to use the scripts? I'm trying to run the code, but without instructions, it is getting difficult to debug.

MoayedHajiAli commented 1 month ago

Hello @v-i-s-h, Apologies for the late reply and for not updating the instructions earlier. I will update the instructions this weekend. As for the pre-trained checkpoints, unfortunately, we will not be able to release them at the moment.

1851759 commented 3 weeks ago

Hi, Thanks for open sourcing the code. Could you please update the README on how to use the scripts? Thanks a lot.

MoayedHajiAli commented 3 weeks ago

@v-i-s-h @1851759 I sincerely apologies about the delay in adding the instructions. I have added the instructions now and tested the code, please let me know if you faced any issues running it.

To compensate for the delay, I am willing to meet with you and walk you through the code or answer any questions that you may have. To arrange for the meeting, please email me at mh155@rice.edu.

1851759 commented 2 weeks ago

Thanks for updating!I meet some questions in training. The function log_images in src.models.vidstyleode(line 404) needs (B x T x C x H x W) shape of vid_bf,but it gets (B x C x H x W).SampleT and inversions_bf also lose the dimension T.Could you please help me to solve this question?

vid_bf torch.Size([15, 3, 256, 128]) sampleT (15,) Summoning checkpoint. Traceback (most recent call last): File "/home/zt/lxp/VidStyleODE-official/main.py", line 764, in trainer.fit(model, data.datasets['train'], data.datasets['validation']) File "/home/zt/miniconda3/envs/vidstyleode/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/home/zt/miniconda3/envs/vidstyleode/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/home/zt/miniconda3/envs/vidstyleode/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/zt/miniconda3/envs/vidstyleode/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/home/zt/miniconda3/envs/vidstyleode/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/home/zt/miniconda3/envs/vidstyleode/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/home/zt/miniconda3/envs/vidstyleode/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/home/zt/miniconda3/envs/vidstyleode/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/home/zt/miniconda3/envs/vidstyleode/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 410, in _evaluation_step call._call_callback_hooks(trainer, hook_name, output, hook_kwargs.values()) File "/home/zt/miniconda3/envs/vidstyleode/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks fn(trainer, trainer.lightning_module, args, kwargs) File "/home/zt/lxp/VidStyleODE-official/main.py", line 506, in on_validation_batch_end self.log_img(pl_module, batch, batch_idx, split="val") File "/home/zt/lxp/VidStyleODE-official/main.py", line 472, in log_img images = pl_module.log_images(batch, split=split) File "/home/zt/lxp/VidStyleODE-official/src/models/vidstyleode.py", line 442, in log_images ts = ts - ts[0] IndexError: invalid index to scalar variable.

MoayedHajiAli commented 2 weeks ago

Hello @1851759, Are you sure you are using the up-to-date version? It seems that you are have introduced a new line in main.py (line 764) "trainer.fit(model, data.datasets['train'], data.datasets['validation'])" This line is trying to fit the model on the dataset rather than the data module and thus the batch dimension is missing. Please stick to fitting the model on the data module as in the provided code.

1851759 commented 2 weeks ago

I changed "trainer.fit(model, data)" to "trainer.fit(model, data.datasets['train'], data.datasets['validation'])" because it reported errors with logs(Trainer.fit stopped: No training batches.).

1851759 commented 2 weeks ago

Could you please help me to solve this question?

84.3 M Trainable params
267 M Non-trainable params
351 M Total params 1,407.006 Total estimated model params size (MB) Trainer.fit stopped: No training batches.
wandb: You can sync this run to the cloud by running:

v-i-s-h commented 2 weeks ago

@MoayedHajiAli Thanks a lot for updating the repo. I will go through this and surely get back to you if I have any doubts. Thanks for offering such wonderful support!

1851759 commented 2 weeks ago

Could you please help me to solve this question?

84.3 M Trainable params 267 M Non-trainable params 351 M Total params 1,407.006 Total estimated model params size (MB) Trainer.fit stopped: No training batches. wandb: You can sync this run to the cloud by running:

I solve it by making the batch size smaller than the number of videos.

MoayedHajiAli commented 1 week ago

@1851759 I am glad that you could solve it. Yes, in our code repo we drop the last batch in the dataloader, therefore if you do not have enough number of videos for even a single batch you will get this message. Please let me know if you faced any other issues.

1851759 commented 1 week ago

When I use "resume", it reports the error

Traceback (most recent call last):                                                                                                                                                        
  File "/xxx/VidStyleODE-official/main.py", line 794, in <module>                                                                                                                 
    trainer = Trainer(**vars(trainer_opt), **trainer_kwargs, resume_from_checkpoint=ckpt)                                                                                                 
  File "/xxx/miniconda3/envs/vidstyleode/lib/python3.10/site-packages/pytorch_lightning/utilities/argparse.py", line 70, in insert_env_defaults
    return fn(self, **kwargs)
TypeError: Trainer.__init__() got an unexpected keyword argument 'resume_from_checkpoint'

I read the doc of pytorch_lightning 2.2.1.There is no argument 'resume_from_checkpoint' in Trainer.init().

Trainer.fit(model, train_dataloaders=None, val_dataloaders=None, datamodule=None, ckpt_path=None)

Parameters:
xxx

ckpt_path – Path/URL of the checkpoint from which training is resumed. Could also be one of two special keywords "last" and "hpc". If there is no checkpoint file at the path, an exception is raised.

Do I need to change "trainer.fit(model, data)" to "trainer.fit(model, data, ckpt_path=opt.ckpt)" if using opt.resume？

MoayedHajiAli commented 1 week ago

Hello @1851759, Yes. Since I have updated the code to a newer pytorch lightning version, there is some legacy code. I only tested the main functionality of training with the new pytorch lightning. Now, I have updated the repo with the correct resume (i.e using passing the checkpoint to the fit function) and added instruction in the README about how to resume.

MoayedHajiAli / VidStyleODE-official

Trained checkpoints & README instructions #2