OpenDriveLab / Vista

[NeurIPS 2024] A Generalizable World Model for Autonomous Driving
https://opendrivelab.com/Vista
Apache License 2.0
558 stars 41 forks source link

Any plan to evaluation code? #9

Open ILOFI opened 4 months ago

ILOFI commented 4 months ago

Thank you very much for your exciting work. Do you have any plan to release evaluation code corresponding to Table 2?

Little-Podi commented 4 months ago

No problem. I will clean and share the evaluation code later (probably after the CVPR conference week).

Our results are obtained from the whole nuScenes validation set, which includes 5369 samples in total. The evaluation code for FID uses FrechetInceptionDistance module of torchmetrics, and the evaluation code for FVD is modified from LVDM. Hope these are helpful if you want to evaluate on your own before our release.

ILOFI commented 4 months ago

Thank you very much for your helpful reply, looking forward to your further updates.

ABaldrati commented 4 months ago

Hi @Little-Podi,

First of all, thank you for your excellent work and insightful paper!

I'm attempting to replicate the results presented in Table 2 of your paper. Could you please clarify if you employed any form of action control when generating the videos, or if you used just the pre-trained video model without any action control?

Additionally, could you share any tips or common pitfalls to watch out for that might affect the replication of the results in this table?

Little-Podi commented 4 months ago

Hi @ABaldrati, thanks for your interest.

Could you please clarify if you employed any form of action control when generating the videos, or if you used just the pre-trained video model without any action control?

The results in Table 2 are evaluated in action-free mode.

Additionally, could you share any tips or common pitfalls to watch out for that might affect the replication of the results in this table?

There is nothing special about our evaluation. Just make sure you evaluate on all samples from the nuScenes validation set, which may take days with a single GPU.

ABaldrati commented 4 months ago

Hi @Little-Podi,

Thank you immensely for your availability and prompt response!

Before I start running the inference (as it takes quite some time), I just want to ensure that all the hyperparameters are set correctly. Specifically:

I apologize for any inconvenience, but I want to be absolutely certain everything is set correctly.

Thank you again for your help!

Little-Podi commented 4 months ago

Exactly, but I think I used --n_conds = 1 for evaluation. Using --n_conds = 3 may lead to similar or better results, just like the effect of using ground truth action controls. You can also disable --rand_gen (by calling this argument) to automatically go through all validation samples. Besides, remember to take all predicted frames into account during the evaluation.

ABaldrati commented 4 months ago

Thanks so much for your quick response and for your availability. I really appreciate you making the code open-source and releasing the weights.

Thanks again for your help!

Little-Podi commented 4 months ago

No worries, feel free to contact us if you have any further questions.

ABaldrati commented 4 months ago

Hi @Little-Podi,

First of all, thank you very much for your support.

I've successfully generated all the videos for the nuScenes validation set and can replicate the FID numbers reported in Table 2, achieving even slightly lower numbers. However, I'm having difficulty replicating the FVD numbers. Could you please provide more details on the specific parameters you used for computing the FVD? For instance, the resolution, number of frames, resizing strategy, and any other relevant details would be extremely helpful.

Thank you again for your help!

Little-Podi commented 4 months ago

Hi @ABaldrati, thanks for your feedback. Sorry for the late reply. I have returned from CVPR, but I still have lots of things to deal with in the following days.

Could you please provide more details on the specific parameters you used for computing the FVD?

All 25 frames in each clip are used for calculating FVD. I just checked our evaluation script. The frame resolution is resized to (256, 448) when loading the generated images, and is eventually resized to (224, 224) before sending to the I3D model. I don't remember why we conduct resizing twice, but I will check it.

I'm having difficulty replicating the FVD numbers.

May I ask what are the FID and FVD scores you got? In fact, we continue to tune the checkpoint for a few iterations under the setting of phase2_stage2 before its release. I didn't retest it in terms of metrics, but I think it should be close. I will verify later to decide if it is necessary to provide the older checkpoint. Based on the few samples I have seen, I think the current checkpoint is better from the perceptual perspective.

ABaldrati commented 4 months ago

Hi @Little-Podi,

Thank you for your response!

Could you please clarify if the resizing from (256, 448) to (224, 224) is done using non-proportional resizing or a center crop?

For reference, I obtained a FID score of 6.7, which is very close to the 6.9 reported in the paper, indicating that our results are comparable. However, my FVD score is 139, which is significantly different, leading me to believe there might be an issue with my evaluation script.

Thank again!

Little-Podi commented 4 months ago

Could you please clarify if the resizing from (256, 448) to (224, 224) is done using non-proportional resizing or a center crop?

Oh, now I know why we conduct the resizing separately. We did center cropping before resizing from (576, 1024) to (256, 448) via the Pillow package. We didn't use cropping when resizing from (256, 448) to (224, 224) via F.interpolate. Did you evaluate on all 5369 video clips? The FVD score seems to be too high. I will retest the checkpoint and also provide the cleaned evaluation code later.

ABaldrati commented 4 months ago

Hi @Little-Podi,

Thank you for the clarification.

I’m a bit confused about the center cropping before resizing from (576, 1024) to (256, 448) since the proportions seem to be maintained in this step. Could you please provide a brief step-by-step description of each resizing step?

Yes, I have evaluated on all 5369 video clips.

Thank you for your assistance!

Little-Podi commented 4 months ago

I’m a bit confused about the center cropping before resizing from (576, 1024) to (256, 448) since the proportions seem to be maintained in this step. Could you please provide a brief step-by-step description of each resizing step?

The aspect ratios are almost the same, but some pixels will leak without center cropping. The implementation is identical to our data preprocessing here. For the latter resizing step, it is like:

output_frames = F.interpolate(input_frames, size=(224, 224), mode="bilinear", align_corners=False)
ABaldrati commented 4 months ago

Hi @Little-Podi,

Thank you for your availability and the detailed information.

Despite following the provided details, I still can't replicate the FVD results. I'll wait for the release of the evaluation code.

Thanks again, and great work on the project!