Open ILOFI opened 4 months ago
No problem. I will clean and share the evaluation code later (probably after the CVPR conference week).
Our results are obtained from the whole nuScenes validation set, which includes 5369 samples in total. The evaluation code for FID uses FrechetInceptionDistance module of torchmetrics, and the evaluation code for FVD is modified from LVDM. Hope these are helpful if you want to evaluate on your own before our release.
Thank you very much for your helpful reply, looking forward to your further updates.
Hi @Little-Podi,
First of all, thank you for your excellent work and insightful paper!
I'm attempting to replicate the results presented in Table 2 of your paper. Could you please clarify if you employed any form of action control when generating the videos, or if you used just the pre-trained video model without any action control?
Additionally, could you share any tips or common pitfalls to watch out for that might affect the replication of the results in this table?
Hi @ABaldrati, thanks for your interest.
Could you please clarify if you employed any form of action control when generating the videos, or if you used just the pre-trained video model without any action control?
The results in Table 2 are evaluated in action-free mode.
Additionally, could you share any tips or common pitfalls to watch out for that might affect the replication of the results in this table?
There is nothing special about our evaluation. Just make sure you evaluate on all samples from the nuScenes validation set, which may take days with a single GPU.
Hi @Little-Podi,
Thank you immensely for your availability and prompt response!
Before I start running the inference (as it takes quite some time), I just want to ensure that all the hyperparameters are set correctly. Specifically:
--n_rounds = 1
--n_frames = 25
--n_conds = 3
? The default value in the sample.py script is 1, but I believe you used 3, correct? Please correct me if I'm wrong.--cfg_scale = 2.5
I apologize for any inconvenience, but I want to be absolutely certain everything is set correctly.
Thank you again for your help!
Exactly, but I think I used --n_conds = 1
for evaluation. Using --n_conds = 3
may lead to similar or better results, just like the effect of using ground truth action controls. You can also disable --rand_gen
(by calling this argument) to automatically go through all validation samples. Besides, remember to take all predicted frames into account during the evaluation.
Thanks so much for your quick response and for your availability. I really appreciate you making the code open-source and releasing the weights.
Thanks again for your help!
No worries, feel free to contact us if you have any further questions.
Hi @Little-Podi,
First of all, thank you very much for your support.
I've successfully generated all the videos for the nuScenes validation set and can replicate the FID numbers reported in Table 2, achieving even slightly lower numbers. However, I'm having difficulty replicating the FVD numbers. Could you please provide more details on the specific parameters you used for computing the FVD? For instance, the resolution, number of frames, resizing strategy, and any other relevant details would be extremely helpful.
Thank you again for your help!
Hi @ABaldrati, thanks for your feedback. Sorry for the late reply. I have returned from CVPR, but I still have lots of things to deal with in the following days.
Could you please provide more details on the specific parameters you used for computing the FVD?
All 25 frames in each clip are used for calculating FVD. I just checked our evaluation script. The frame resolution is resized to (256, 448) when loading the generated images, and is eventually resized to (224, 224) before sending to the I3D model. I don't remember why we conduct resizing twice, but I will check it.
I'm having difficulty replicating the FVD numbers.
May I ask what are the FID and FVD scores you got? In fact, we continue to tune the checkpoint for a few iterations under the setting of phase2_stage2 before its release. I didn't retest it in terms of metrics, but I think it should be close. I will verify later to decide if it is necessary to provide the older checkpoint. Based on the few samples I have seen, I think the current checkpoint is better from the perceptual perspective.
Hi @Little-Podi,
Thank you for your response!
Could you please clarify if the resizing from (256, 448) to (224, 224) is done using non-proportional resizing or a center crop?
For reference, I obtained a FID score of 6.7, which is very close to the 6.9 reported in the paper, indicating that our results are comparable. However, my FVD score is 139, which is significantly different, leading me to believe there might be an issue with my evaluation script.
Thank again!
Could you please clarify if the resizing from (256, 448) to (224, 224) is done using non-proportional resizing or a center crop?
Oh, now I know why we conduct the resizing separately. We did center cropping before resizing from (576, 1024) to (256, 448) via the Pillow package. We didn't use cropping when resizing from (256, 448) to (224, 224) via F.interpolate
. Did you evaluate on all 5369 video clips? The FVD score seems to be too high. I will retest the checkpoint and also provide the cleaned evaluation code later.
Hi @Little-Podi,
Thank you for the clarification.
I’m a bit confused about the center cropping before resizing from (576, 1024) to (256, 448) since the proportions seem to be maintained in this step. Could you please provide a brief step-by-step description of each resizing step?
Yes, I have evaluated on all 5369 video clips.
Thank you for your assistance!
I’m a bit confused about the center cropping before resizing from (576, 1024) to (256, 448) since the proportions seem to be maintained in this step. Could you please provide a brief step-by-step description of each resizing step?
The aspect ratios are almost the same, but some pixels will leak without center cropping. The implementation is identical to our data preprocessing here. For the latter resizing step, it is like:
output_frames = F.interpolate(input_frames, size=(224, 224), mode="bilinear", align_corners=False)
Hi @Little-Podi,
Thank you for your availability and the detailed information.
Despite following the provided details, I still can't replicate the FVD results. I'll wait for the release of the evaluation code.
Thanks again, and great work on the project!
Thank you very much for your exciting work. Do you have any plan to release evaluation code corresponding to Table 2?