BrandonHanx / mmf

[ECCV 2022] FashionViL: Fashion-Focused V+L Representation Learning
https://mmf.sh/
Other
58 stars 7 forks source link

Pretrain model low recall in TGIR task #7

Closed pntt3011 closed 2 years ago

pntt3011 commented 2 years ago

❓ Questions and Help

In your paper, the avg recall of fixed encoder + no fine-tuned is about 30% for TGIR task. But when I run the code with your pretrained model, the result is relatively low (< 10%).

What I have done

  1. Install dependencies
    
    conda create -n mmf python=3.7
    conda activate mmf

git clone https://github.com/BrandonHanx/mmf.git cd mmf pip install --editable . cd ..

pip install wandb einops

2. Prepare dataset:
[FashionIQ dataset](https://github.com/XiaoxiaoGuo/fashion-iq/issues/18)
[Your metadata](https://drive.google.com/drive/folders/1H6CodM5Bh9SxsIrrOca9MoqU_A6T6xL4)

3. Download pretrained models
[VQVAE](https://drive.google.com/file/d/11QKoXEG1NeFqUyLg4kOjkJTgQsiYHpdu/view)
[Your pretrained checkpoint](https://drive.google.com/file/d/1G_RyxQNbmkQDN6xUjP-IH2D22jW8bPz3/view)

4. Run this command
```bash
python mmf_cli/run.py \
config=projects/fashionvil/configs/e2e_composition.yaml \
model=fashionvil \
dataset=fashioniq \
run_type=test \
checkpoint.resume_file=save/fashionvil_e2e_pretrain_final/fashionvil_final.pth

Results

wandb: Run summary:
wandb:                  test/fashioniq/bbc_loss 5.17214
wandb:         test/fashioniq/r@k_fashioniq/avg 0.05547
wandb:  test/fashioniq/r@k_fashioniq/dress_r@10 0.02082
wandb:  test/fashioniq/r@k_fashioniq/dress_r@50 0.07685
wandb:  test/fashioniq/r@k_fashioniq/shirt_r@10 0.0314
wandb:  test/fashioniq/r@k_fashioniq/shirt_r@50 0.08391
wandb: test/fashioniq/r@k_fashioniq/toptee_r@10 0.02652
wandb: test/fashioniq/r@k_fashioniq/toptee_r@50 0.09332
wandb:                          test/total_loss 5.17214
wandb:                      trainer/global_step 0

I expect the test/fashioniq/r@k_fashioniq/avg would be around 30%.

(You can check this colab notebook for more information) Thanks you very much

BrandonHanx commented 2 years ago

Many reasons:

  1. the model you used is only a pre-trained model, you need to fine-tune it on FashionIQ (see README). In other words, you are testing FashionViL in a zero-shot manner.
  2. the image encoder of the fixed encoder method mentioned in Table 4 is off-the-shelf resnet152, you need to extract image features using resnet152 first. Please note "fixed encoder" means resnet152 is not updated during pre-training and fine-tuning. This is for fair comparison with other methods, not used as our final method.
BrandonHanx commented 2 years ago

I uploaded the fine-tuned model (tgir on FashionIQ) here: https://drive.google.com/file/d/1R8DLIHt0VazrJnZLA6jK-FfzP3m8OwFb/view?usp=share_link, just FYI.

Now you can run evaluation according to https://github.com/BrandonHanx/mmf#evaluation

pntt3011 commented 2 years ago

Thank you for reply @BrandonHanx, I will try that model and let you know the result later.

BrandonHanx commented 2 years ago

No worries, please let me know if you have any problems.

pntt3011 commented 2 years ago

Hi @BrandonHanx, the fine-tuned model works just as in the paper. I would like to do a fashion search engine for my college graduation thesis so your paper helps me really much.

A little off-topic, may I have your fine-tuned model for OCIR task?

BrandonHanx commented 2 years ago

You are welcome.

Sorry, I only have tgir and itr/tir fine-tuned models at hand since it has been quite a long time ago. You can fine-tune by yourself according to the instructions in README.

pntt3011 commented 2 years ago

Thanks for considering my request, I'll close the issue now.

renrenzsbbb commented 1 year ago

I uploaded the fine-tuned model (tgir on FashionIQ) here: https://drive.google.com/file/d/1R8DLIHt0VazrJnZLA6jK-FfzP3m8OwFb/view?usp=share_link, just FYI.

Now you can run evaluation according to https://github.com/BrandonHanx/mmf#evaluation

Thanks for your great work. can you propose sub-category task pretrain model. Thanks in advance.