CrossmodalGroup / GSMN

Implementation of our CVPR2020 paper, Graph Structured Network for Image-Text Matching
163 stars 30 forks source link

reproduction #8

Open gedaye11 opened 4 years ago

gedaye11 commented 4 years ago

Hi, First of all, I appreciate the article you wrote, the content is very clear, but through the code you opened, according to the parameters you provided, it is very difficult to reproduce the results in the paper, and even far from it. Is it possible to open the pretrain model for us to use? I really want to do something innovative with your work. Hope to get your reply.

CrossmodalGroup commented 4 years ago

As we have stated before, the best performance is achieved by sparse + dense model, and we have uploaded the ensemble code "test_stack.py". The single model should be slightly lower(about 1% Recall@K) than our paper results since some details are missing while we reorganize the paper, we will fix this issue as soon as possible. Also, the pretained model will be released as soon as possible.

gedaye11 commented 4 years ago

I completely installed your release code and ran through 30 epochs. The single model effect still could not reach the performance of your paper.

CrossmodalGroup commented 4 years ago

The single model should be slightly lower(about 1% Recall@K) , we will fix it!

gedaye11 commented 4 years ago

The optimal single model is less effective than more than 1% Recall@K !

CrossmodalGroup commented 4 years ago

Which dataset do you use? Can you provide all your Recall value?

gedaye11 commented 4 years ago

I'm going to sort out the way I run it and the results, and I'm going to put it here, and I hope you'll correct me. train code: python train.py --data_path " /data" --data_name f30k_precomp --vocab_path "/vocab" --logger_name runs/log --model_name "Weights_coco" --bi_gru --max_violation --lambda_softmax 20 --num_epochs 30 --lr_update 15 --learning_rate 0.0002 --embed_size 1024 --batch_size 64

Best model results: calculate similarity time: 391.327766895 rsum: 473.8 Average i2t Recall: 85.3 Image to text: 68.9 91.2 95.9 1.0 4.3 Average t2i Recall: 72.6 Text to image: 52.5 78.8 86.5 1.0 9.9

I haven't made any changes to the code except for the data path. If I'm in the right place this is the result of a dense for f30k.

gedaye11 commented 4 years ago

I hope to get your correction and help me reproduce the effect in your paper.

CrossmodalGroup commented 4 years ago

Thanks! We will check our code!

gedaye11 commented 4 years ago

I think you can first release your single pre-train model !

CrossmodalGroup commented 4 years ago

Hi, we have modified our code and uploaded the single pretrained model from https://drive.google.com/file/d/1kEi92w49Et5D2WVOv-Lc52HcpF2SPNNF/view?usp=sharing The result of this model is: rsum: 481.4 Average i2t Recall: 87.0 Image to text: 74.4 91.1 95.4 1.0 3.4 Average t2i Recall: 73.5 Text to image: 54.1 79.9 86.5 1.0 9.4

gedaye11 commented 4 years ago

Can you tell me about the environment you run in? Like the requirement.txt

CrossmodalGroup commented 4 years ago

The requirement has been listed at the homepage