Open z27833009 opened 1 year ago
qrel.csv is in the downloaded dataset zip file. You can check document/MOCHEG_dataset_statement.pdf to understand the dataset structure and obtain the (text/image)_qrel.csv file.
you can specify the path to this qrel.csv file in the Python argument. Please check the retrieve script to locate the corresponding path argument. Thanks.
There are indeed 3 (text/image)_qrel.csv files in each file, but which one should I use? According to some of your annotation, I think text_evidence_qrels_sentence_level.csv should be used. Besides, I would like to understand the purpose of cross_encoder and where to find the alternative query_result_txt.csv for the default top_candidate_corpus_path parameter in retrieve_train.py
1, for text, it is text_evidence_qrels_sentence_level.csv. 2, for image, there is only one qrel file. 3, for the purpose of cross-encoder, please check https://www.sbert.netexamples/applications/cross-encoder/README.html
Thanks for your explaination, but how can I get to query_result_txt.csv for the parameter top_candidate_corpus_path in retrieve_train.py. And do I need to train a cross encoder by myself, because I could not find the step of trainning cross encoder
As we mentioned in the paper, "The BERT-based re-ranking model is pre-trained on the MS MARCO Passage Ranking dataset which is designed for text retrieval." You do not have to train cross-encoder since we used the pre-trained cross-encoder. See "cross_encoder_checkpoint" argument in retrieve_similarity_recall.py for the detail.
Sorry. We just caught up on the conference deadline.
I've also come across situations where files don't exist:
FileNotFoundError: [Errno 2] No such file or directory: '/data/Projects/Mocheg/data/images/00017-proof-06-GettyImages-1137888397.jpg'
when I run
python retrieve_similarity_recall.py --bi_encoder_checkpoint=/data/Projects/Mocheg/retrieval/output/runs_3/00005-train_bi-encoder-multi-qa-MiniLM-L6-cos-v1-2023-11-24_14-10-21 --image_encoder_checkpoint=/data/Projects/Mocheg/retrieval/output/runs_3/00004-train_bi-encoder-clip-ViT-B-32-2023-11-24_11-18-44 --media=img_txt --top_k=10 --csv_out_dir=/data/Projects/Mocheg/data/test/retrieval/retrieval_result_10.csv
@OPilgrim There is no 00017-proof-06-GettyImages-1137888397.jpg, however there is 00017-530390-06-GettyImages-1137888397.jpg. Could you debug the code to see why it searches for "00017-proof-06-GettyImages-1137888397.jpg"? Could you also share the complete error track, like in which function you encountered this error? The issue does not appear in my local running.
@OPilgrim There is no 00017-proof-06-GettyImages-1137888397.jpg, however there is 00017-530390-06-GettyImages-1137888397.jpg. Could you debug the code to see why it searches for "00017-proof-06-GettyImages-1137888397.jpg"? Could you also share the complete error track, like in which function you encountered this error? The issue does not appear in my local running.
The problem occurred when retrieving images. At first, I thought there was a setting of content="proof"
, but after I printed out the value of content
, there were only "all"
and "img"
, so it was not clear where the proof
in the image name came from
Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 96610/96610 [17:46<00:00, 90.59it/s]
0/2001: 0.3024411201477051, 0.2415764480829239
100/2001: 0.5788431167602539, 0.4707832932472229
200/2001: 0.5831706523895264, 0.4668898582458496
300/2001: 0.5832053422927856, 0.4636369049549103
400/2001: 0.5863810181617737, 0.48484358191490173
500/2001: 0.5832879543304443, 0.5186882019042969
600/2001: 0.5846362113952637, 0.5425591468811035
700/2001: 0.586365818977356, 0.5623708963394165
800/2001: 0.5859039425849915, 0.578498125076294
900/2001: 0.588605523109436, 0.5907866954803467
1000/2001: 0.5892195701599121, 0.6003651022911072
1100/2001: 0.5880477428436279, 0.5923115015029907
1200/2001: 0.5861196517944336, 0.6010985970497131
1300/2001: 0.5832314491271973, 0.603545069694519
1400/2001: 0.5815831422805786, 0.606799304485321
1500/2001: 0.5794385075569153, 0.6095340847969055
1600/2001: 0.5797507762908936, 0.6133111119270325
1700/2001: 0.5788807272911072, 0.6144909858703613
1800/2001: 0.5782501101493835, 0.6066485643386841
1900/2001: 0.5776910185813904, 0.5988103747367859
2000/2001: 0.5785987377166748, 0.5923917889595032
0.578887939453125, 0.5926878452301025,0.5857065916061401
ftfy or spacy is not installed using BERT BasicTokenizer instead of ftfy.
Images: 122246
0: 0.0, 0.0
Traceback (most recent call last):
File "/data/Projects/Mocheg/retrieve_similarity_recall.py", line 48, in <module>
main()
File "/data/miniconda3/envs/mmfc/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/data/miniconda3/envs/mmfc/lib/python3.9/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/data/miniconda3/envs/mmfc/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/data/miniconda3/envs/mmfc/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/data/miniconda3/envs/mmfc/lib/python3.9/site-packages/click/decorators.py", line 33, in new_func
return f(get_current_context(), *args, **kwargs)
File "/data/Projects/Mocheg/retrieve_similarity_recall.py", line 43, in main
training_loop.training_loop(args,rank=0)
File "/data/Projects/Mocheg/retrieval/training/training_loop.py", line 28, in training_loop
image_retrieve(args,relevant_document_img_list,dataloader,saver)
File "/data/Projects/Mocheg/retrieval/training/training_loop.py", line 116, in image_retrieve
cur_precision,cur_recall=scorer.precision_recall_by_similarity(semantic_results,relevant_document_img_list,img_evidence_list,image_corpus)
File "/data/Projects/Mocheg/retrieval/utils/metrics.py", line 67, in precision_recall_by_similarity
retrieved_document_list,evidence_document_list=get_images(retrieved_document_name_list,evidence_document_name_list,img_folder)
File "/data/Projects/Mocheg/retrieval/utils/metrics.py", line 76, in get_images
evidence_document_list=[Image.open(os.path.join(img_folder,filepath)) for filepath in evidence_document_name_list]
File "/data/Projects/Mocheg/retrieval/utils/metrics.py", line 76, in <listcomp>
evidence_document_list=[Image.open(os.path.join(img_folder,filepath)) for filepath in evidence_document_name_list]
File "/data/miniconda3/envs/mmfc/lib/python3.9/site-packages/PIL/Image.py", line 3243, in open
fp = builtins.open(filename, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '/data/Projects/Mocheg/data/images/00017-proof-06-GettyImages-1137888397.jpg'
Sorry for the late reply. After checking the dataset, we find that that specific image is missing in the released dataset. Sorry for the inconvenience. We have updated the dataset (click to download mocheg_with_tweet_2023_03.tar.gz). Do you mind redownloading the updated dataset? In the updated dataset, you should be able to find the image "00017-proof-06-GettyImages-1137888397.jpg" under the "mocheg/images" folder. Thanks!
when i ran inference codein train.sh python retrieve_train.py --mode=test --train_config=CROSS_ENCODER it causes an error due to the a miss file qrels.csv in data/train