WillDreamer / Aurora

[NeurIPS2023] Parameter-efficient Tuning of Large-scale Multimodal Foundation Model
https://arxiv.org/abs/2305.08381
80 stars 7 forks source link

The details of experiments look very solid. Has any one reproduced successfully? #12

Closed Arsiuuu closed 9 months ago

Arsiuuu commented 9 months ago

I tried some times, but got poor results.

BinhuiXie commented 9 months ago

+1

Arsiuuu commented 9 months ago

I only tried on MSR-VTT for video-text retrieval and got poor results too. @BinhuiXie

xinlong-yang commented 9 months ago

@BinhuiXie Hi there, because we have done a large number of ablation experiments and due to storage resource limitations, we are ashamed that we did not record all ckpt parameters completely, but we found the Flickr30K results of a version of ablation experiment at that time, and you can see that R@1 does not have a large drop. If R@1 drops too much (more than 10), we think it may be because there is no convergence. image

xinlong-yang commented 9 months ago

@Arsiuuu hi there, we follow the same setting in Uniadapter, so you may check your pretraining .pth and proper dataset to be consistent with the Uniadapter, and since ITC loss uses a global queue, you should better use about the same batch size and nGPUs.

BinhuiXie commented 9 months ago

@xinlong-yang thank you for your response.

actually, the large drop was caused by the incorrect command.

original

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py --config ./configs/retrieval_flickr.yaml --output_dir output/flickr --evaluate 

this will load the BLIP pre-trained parameters :rofl:

and the fine-tuned parameters by Aurora could be loaded as follows.

correct

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py --config ./configs/retrieval_flickr.yaml --output_dir output/flickr --evaluate ----pretrained output/flickr/checkpoint_3.pth

thanks again! keep up the fantastic work :rocket: