TXH-mercury / VAST

Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
https://arxiv.org/abs/2305.18500
MIT License
231 stars 15 forks source link

Is there any plan to release the finetune model of downstream tasks? #15

Open Wenju-Huang opened 6 months ago

Fanzy27 commented 6 months ago

README.md

carlinds commented 5 months ago

README.md

As far as I can see, the README only describes the process of fine-tuning. Would it be possible to share the weights for the models you have already fine-tuned? In particular, I am interested in the model weights for the VQA.

DelusionalLogic commented 4 months ago

I've run the retrieval finetune on MSR-VTT and thrown it up on huggingface https://huggingface.co/delusionallogic/vast_finetune_msrvtt_retrieval

Trained on 4xGTX600Ada, using 177.7 GB of video memory and 100GB of system memory. 64GB of storage space. For 13.5 hours at around 3.8 dollars an hour.

04/28/2024 06:58:49 - INFO - __main__ -   ==== evaluation--ret%tvas--msrvtt_ret_ret_itc_tvas========
04/28/2024 06:58:49 - INFO - __main__ -   {'video_r1': 52.7, 'video_recall': '52.7/78.1/86.9', 'video_ravg': 72.6}
04/28/2024 06:58:49 - INFO - __main__ -   ==== evaluation--ret%tvas--msrvtt_ret_ret_itm_tvas========
04/28/2024 06:58:49 - INFO - __main__ -   {'video_r1': 63.2, 'video_recall': '63.2/83.3/89.3', 'video_ravg': 78.6}
iamthephd commented 3 months ago

@DelusionalLogic Thanks for sharing! When tested on the checkpoint shared by you, I got results 3-4% less than what you have mentioned. Do you know why it is less?