TXH-mercury / VALOR

Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
https://arxiv.org/abs/2304.08345
MIT License
256 stars 15 forks source link

Plan to release finetuned models? #11

Closed yt2639 closed 11 months ago

yt2639 commented 1 year ago

Hi authors,

Amazing paper and thanks for providing this nice code base. I have a question regarding the finetuned model, specifically for video-text retrieval task. Do you have plans to release those models? I do understand that we can use the pretrained VALOR as provided in the main page README (shown below)

Download Checkpoints

to finetune the pretrained models for down-stream tasks. But in the paper, the implementation details suggest using 8 A100 GPUs which I don't have. So I probably cannot reproduce the good results reported in the paper. Therefore, I am wondering if you plan to release the finetuned models for video-text retrieval task?

Thanks! Shane

kenhuang1964 commented 1 year ago

Hey, @yt2639 did you find an alternative model?

yt2639 commented 1 year ago

Hey, @yt2639 did you find an alternative model?

No, I downloaded the pretrained weights and finetuned it myself. It seems to get similar results on 8 A5000 gpus for msrvtt dataset. But still, if authors can release the finetuned models, that will be great and very much appreciated.

thechargedneutron commented 1 year ago

@yt2639 Hi, what's the performance after finetuning? I am getting significantly lower scores after finetuning on 8 32GB V100 GPUs. I also faced some AssertionErrors as mentioned in #15 and I had to comment out all the assert checks in all the metrics files (BLEU, ROUGE, METEOR etc.). Did you also have to do this?

Here is the performance when I finetune

07/01/2023 02:10:55 - INFO - __main__ -   ====-evaluation--cap%tva%tv--msrvtt_cap_tva=====step 10089--==========

07/01/2023 02:10:55 - INFO - __main__ -   {'Bleu_1': 78.76, 'Bleu_2': 67.74, 'Bleu_3': 55.93, 'Bleu_4': 44.78, 'METEOR': 28.8, 'ROUGE_L': 62.59, 'CIDEr': 55.79}
07/01/2023 02:10:55 - INFO - __main__ -   ======evaluation--cap%tva%tv--msrvtt_cap_tva====history best step: 4035==

07/01/2023 02:10:55 - INFO - __main__ -   {'Bleu_1': 79.48, 'Bleu_2': 67.83, 'Bleu_3': 55.77, 'Bleu_4': 44.78, 'METEOR': 29.13, 'ROUGE_L': 62.86, 'CIDEr': 56.34}
07/01/2023 02:10:55 - INFO - __main__ -   ====-evaluation--cap%tva%tv--msrvtt_cap_tv=====step 10089--==========

07/01/2023 02:10:55 - INFO - __main__ -   {'Bleu_1': 78.14, 'Bleu_2': 66.95, 'Bleu_3': 55.26, 'Bleu_4': 44.18, 'METEOR': 28.56, 'ROUGE_L': 62.32, 'CIDEr': 55.97}
07/01/2023 02:10:55 - INFO - __main__ -   ======evaluation--cap%tva%tv--msrvtt_cap_tv====history best step: 10089==

07/01/2023 02:10:55 - INFO - __main__ -   {'Bleu_1': 78.14, 'Bleu_2': 66.95, 'Bleu_3': 55.26, 'Bleu_4': 44.18, 'METEOR': 28.56, 'ROUGE_L': 62.32, 'CIDEr': 55.97}
yt2639 commented 1 year ago

@yt2639 Hi, what's the performance after finetuning? I am getting significantly lower scores after finetuning on 8 32GB V100 GPUs. I also faced some AssertionErrors as mentioned in #15 and I had to comment out all the assert checks in all the metrics files (BLEU, ROUGE, METEOR etc.). Did you also have to do this?

Here is the performance when I finetune

07/01/2023 02:10:55 - INFO - __main__ -   ====-evaluation--cap%tva%tv--msrvtt_cap_tva=====step 10089--==========

07/01/2023 02:10:55 - INFO - __main__ -   {'Bleu_1': 78.76, 'Bleu_2': 67.74, 'Bleu_3': 55.93, 'Bleu_4': 44.78, 'METEOR': 28.8, 'ROUGE_L': 62.59, 'CIDEr': 55.79}
07/01/2023 02:10:55 - INFO - __main__ -   ======evaluation--cap%tva%tv--msrvtt_cap_tva====history best step: 4035==

07/01/2023 02:10:55 - INFO - __main__ -   {'Bleu_1': 79.48, 'Bleu_2': 67.83, 'Bleu_3': 55.77, 'Bleu_4': 44.78, 'METEOR': 29.13, 'ROUGE_L': 62.86, 'CIDEr': 56.34}
07/01/2023 02:10:55 - INFO - __main__ -   ====-evaluation--cap%tva%tv--msrvtt_cap_tv=====step 10089--==========

07/01/2023 02:10:55 - INFO - __main__ -   {'Bleu_1': 78.14, 'Bleu_2': 66.95, 'Bleu_3': 55.26, 'Bleu_4': 44.18, 'METEOR': 28.56, 'ROUGE_L': 62.32, 'CIDEr': 55.97}
07/01/2023 02:10:55 - INFO - __main__ -   ======evaluation--cap%tva%tv--msrvtt_cap_tv====history best step: 10089==

07/01/2023 02:10:55 - INFO - __main__ -   {'Bleu_1': 78.14, 'Bleu_2': 66.95, 'Bleu_3': 55.26, 'Bleu_4': 44.18, 'METEOR': 28.56, 'ROUGE_L': 62.32, 'CIDEr': 55.97}

Hi @thechargedneutron , I didn't get the AssertionErrors. I only finetuned the video-text retrieval task on msrvtt dataset and this is the log I get:

20:17:18 - INFO - __main__ -   ====-evaluation--ret%tva%tv--msrvtt_ret_t_v=====step 9789--==========

20:17:18 - INFO - __main__ -   {'video_recall': '50.6/77.6/85.9', 'video_ravg': 71.4, 'video_medianR': 1.0, 'video_meanR': 12.203125}
20:17:18 - INFO - __main__ -   ======evaluation--ret%tva%tv--msrvtt_ret_t_v====history best step: 4894==

20:17:18 - INFO - __main__ -   {'video_recall': '53.0/77.7/86.1', 'video_ravg': 72.3, 'video_medianR': 1.0, 'video_meanR': 11.34375}
20:17:18 - INFO - __main__ -   ====-evaluation--ret%tva%tv--msrvtt_ret_t_va=====step 9789--==========

20:17:18 - INFO - __main__ -   {'video_recall': '54.5/80.8/88.0', 'video_ravg': 74.4, 'video_medianR': 1.0, 'video_meanR': 11.1171875}
20:17:18 - INFO - __main__ -   ======evaluation--ret%tva%tv--msrvtt_ret_t_va====history best step: 9789==

20:17:18 - INFO - __main__ -   {'video_recall': '54.5/80.8/88.0', 'video_ravg': 74.4, 'video_medianR': 1.0, 'video_meanR': 11.1171875}
20:19:19 - INFO - __main__ -   {'loss_ret%tva%tv--msrvtt_ret/contra_loss': 0.2164306640625, 'loss_ret%tva%tv--msrvtt_ret/total_loss': 0.2164306640625}

So I am not sure if they reported t_va number or t_v in Table. 3 in the paper. If it was t_v, then I only got 50.6 (or 53.0) for it which is lower than 54.4 as reported in Table. 3. But the t_va number is close - 54.5. So I guess maybe they reported t_va number in Table. 3?

A little bit weird thing is that I can actually put in train_batch_size = 64 in my 8 x 24GB A5000 GPUs. Not sure if this is normal as the authors reported using A100 GPUs so at first I thought I cannot use train_batch_size = 64 in my A5000 GPUs.

thechargedneutron commented 1 year ago

Thanks for your comments. You did not get assertionerrors since those are captioning metrics and you tried retrieval. +1 to the request to release finetuned models for the captioning tasks.

TXH-mercury commented 1 year ago

@yt2639 Hi, what's the performance after finetuning? I am getting significantly lower scores after finetuning on 8 32GB V100 GPUs. I also faced some AssertionErrors as mentioned in #15 and I had to comment out all the assert checks in all the metrics files (BLEU, ROUGE, METEOR etc.). Did you also have to do this? Here is the performance when I finetune

07/01/2023 02:10:55 - INFO - __main__ -   ====-evaluation--cap%tva%tv--msrvtt_cap_tva=====step 10089--==========

07/01/2023 02:10:55 - INFO - __main__ -   {'Bleu_1': 78.76, 'Bleu_2': 67.74, 'Bleu_3': 55.93, 'Bleu_4': 44.78, 'METEOR': 28.8, 'ROUGE_L': 62.59, 'CIDEr': 55.79}
07/01/2023 02:10:55 - INFO - __main__ -   ======evaluation--cap%tva%tv--msrvtt_cap_tva====history best step: 4035==

07/01/2023 02:10:55 - INFO - __main__ -   {'Bleu_1': 79.48, 'Bleu_2': 67.83, 'Bleu_3': 55.77, 'Bleu_4': 44.78, 'METEOR': 29.13, 'ROUGE_L': 62.86, 'CIDEr': 56.34}
07/01/2023 02:10:55 - INFO - __main__ -   ====-evaluation--cap%tva%tv--msrvtt_cap_tv=====step 10089--==========

07/01/2023 02:10:55 - INFO - __main__ -   {'Bleu_1': 78.14, 'Bleu_2': 66.95, 'Bleu_3': 55.26, 'Bleu_4': 44.18, 'METEOR': 28.56, 'ROUGE_L': 62.32, 'CIDEr': 55.97}
07/01/2023 02:10:55 - INFO - __main__ -   ======evaluation--cap%tva%tv--msrvtt_cap_tv====history best step: 10089==

07/01/2023 02:10:55 - INFO - __main__ -   {'Bleu_1': 78.14, 'Bleu_2': 66.95, 'Bleu_3': 55.26, 'Bleu_4': 44.18, 'METEOR': 28.56, 'ROUGE_L': 62.32, 'CIDEr': 55.97}

Hi @thechargedneutron , I didn't get the AssertionErrors. I only finetuned the video-text retrieval task on msrvtt dataset and this is the log I get:

20:17:18 - INFO - __main__ -   ====-evaluation--ret%tva%tv--msrvtt_ret_t_v=====step 9789--==========

20:17:18 - INFO - __main__ -   {'video_recall': '50.6/77.6/85.9', 'video_ravg': 71.4, 'video_medianR': 1.0, 'video_meanR': 12.203125}
20:17:18 - INFO - __main__ -   ======evaluation--ret%tva%tv--msrvtt_ret_t_v====history best step: 4894==

20:17:18 - INFO - __main__ -   {'video_recall': '53.0/77.7/86.1', 'video_ravg': 72.3, 'video_medianR': 1.0, 'video_meanR': 11.34375}
20:17:18 - INFO - __main__ -   ====-evaluation--ret%tva%tv--msrvtt_ret_t_va=====step 9789--==========

20:17:18 - INFO - __main__ -   {'video_recall': '54.5/80.8/88.0', 'video_ravg': 74.4, 'video_medianR': 1.0, 'video_meanR': 11.1171875}
20:17:18 - INFO - __main__ -   ======evaluation--ret%tva%tv--msrvtt_ret_t_va====history best step: 9789==

20:17:18 - INFO - __main__ -   {'video_recall': '54.5/80.8/88.0', 'video_ravg': 74.4, 'video_medianR': 1.0, 'video_meanR': 11.1171875}
20:19:19 - INFO - __main__ -   {'loss_ret%tva%tv--msrvtt_ret/contra_loss': 0.2164306640625, 'loss_ret%tva%tv--msrvtt_ret/total_loss': 0.2164306640625}

So I am not sure if they reported t_va number or t_v in Table. 3 in the paper. If it was t_v, then I only got 50.6 (or 53.0) for it which is lower than 54.4 as reported in Table. 3. But the t_va number is close - 54.5. So I guess maybe they reported t_va number in Table. 3?

A little bit weird thing is that I can actually put in train_batch_size = 64 in my 8 x 24GB A5000 GPUs. Not sure if this is normal as the authors reported using A100 GPUs so at first I thought I cannot use train_batch_size = 64 in my A5000 GPUs.

T-VA metric is reported.

TXH-mercury commented 1 year ago

@thechargedneutron @yt2639 @kenhuangsy Hey guys, the finetuned checkpoints of VALOR-base/large on MSRVTT caption/retrieval datasets have been released now, Thanks for your attentions.

Haawron commented 1 year ago

Could you please share the plan to release other versions of fine-tuned models? I am eagerly anticipating the one trained with ActivityNet-QA.