Efficient-Large-Model / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
865 stars 55 forks source link

About perception testset #49

Open mary-0830 opened 1 month ago

mary-0830 commented 1 month ago

Hello authors, Thanks for sharing fantastic jobs. Now I would like to ask where this dataset came from, can you share a link or data? "/lustre/fsw/portfolios/nvr/projects/nvr_elm_llm/dataset/video_datasets_v2/perception_test/"

Efficient-Large-Language-Model commented 1 month ago

https://github.com/google-deepmind/perception_test

Specifically https://storage.googleapis.com/dm-perception-test/zip_data/valid_videos.zip

mary-0830 commented 1 month ago

https://github.com/google-deepmind/perception_test

Specifically https://storage.googleapis.com/dm-perception-test/zip_data/valid_videos.zip

Thank you for your quick reply! I have two questions I would like to ask.

  1. Does the perception test not require gpt assistance for evaluation?
  2. Why is the input in the model_vqa_videoperception.py different from other vqa inference evaluations? def get_model_option returns loss?
Efficient-Large-Language-Model commented 1 month ago
  1. yeah, it does not require gpt assistance
  2. we followed the official repo to implemement the evaluation, you can refer to the offical repo