Closed julian-q closed 3 months ago
Hello,
How does the training_vis images look? Do they look legit?
I am wondering whether you finished downloading the entire dataset, which can take a couple days, or did you accidentally used the mini subset shipped with the checkpoint file?
Some people trained with the mini subset and as a result, it's only training with like 10 videos and overfit.
Here's the latest training_vis/video_2 from wandb - it looks pretty noisy:
I think I have the whole dataset because I symlinked the dataset I previously downloaded with the TECO bash script.
Here's my config.yaml
. I think the one modification I made was the batch size. Maybe I should retry on a machine with more memory so that I can use the original config.
Update: Oh, oops, it turns out I was not training on the whole dataset because I was still using the metadata.json
from the sample dataset. To create a metadata.json
for my existing TECO dataset, I wrote this script. I'll train again with this metadata file and see how it goes!
Even after training with the original config on the whole dataset, it still looks stagnant 🤔
And the training vis is still noisy:
I don't think it's a problem with the dataset, since when I switch to the paper
branch and train the RNN-based video model, the loss goes down just fine
"minecraft_video" is on the paper
branch, "minecraft_unet_transformer" is on the main
branch, same data for both runs
Okay, I tried cloning the diffusion-forcing-transformer
repo for the original implementation of the transformer UNet to see if there's a difference. I started training using this command, and I now see the loss going down:
After 9k steps the training vis looks good too:
Will try to see what the discrepancy is between main
and diffusion-forcing-transformer
..
Hi Julian I will immediately start to debug this - Right now the v1.5 video diffusion is supposed to be exactly the same as diffusion-forcing-transformer
Thank you for pointing this out, I found the bug. Can you pull this repo again?
The bug fix is this
Haha, nice catch:) Loss is going down now on main
! Thank you 🙏
Hi, thanks for the nice repo! I'm trying to follow the instructions for training a video model on Minecraft, and it looks like the loss is staying stagnant for 60k steps.
Command:
python -m main +name=minecraft_unet algorithm=df_video dataset=video_minecraft
Is this normal? Any tips much appreciated!!