VT-NLP / Mocheg

Dataset and Code for Multimodal Fact Checking and Explanation Generation (Mocheg)
Apache License 2.0
39 stars 8 forks source link

miss file in data/train #6

Open z27833009 opened 1 year ago

z27833009 commented 1 year ago

when i ran inference codein train.sh python retrieve_train.py --mode=test --train_config=CROSS_ENCODER it causes an error due to the a miss file qrels.csv in data/train 错误

Barry-Menglong-Yao commented 1 year ago

qrel.csv is in the downloaded dataset zip file. You can check document/MOCHEG_dataset_statement.pdf to understand the dataset structure and obtain the (text/image)_qrel.csv file.

you can specify the path to this qrel.csv file in the Python argument. Please check the retrieve script to locate the corresponding path argument. Thanks.

z27833009 commented 1 year ago

There are indeed 3 (text/image)_qrel.csv files in each file, but which one should I use? According to some of your annotation, I think text_evidence_qrels_sentence_level.csv should be used. Besides, I would like to understand the purpose of cross_encoder and where to find the alternative query_result_txt.csv for the default top_candidate_corpus_path parameter in retrieve_train.py

Barry-Menglong-Yao commented 1 year ago

1, for text, it is text_evidence_qrels_sentence_level.csv. 2, for image, there is only one qrel file. 3, for the purpose of cross-encoder, please check https://www.sbert.netexamples/applications/cross-encoder/README.html

z27833009 commented 1 year ago

Thanks for your explaination, but how can I get to query_result_txt.csv for the parameter top_candidate_corpus_path in retrieve_train.py. And do I need to train a cross encoder by myself, because I could not find the step of trainning cross encoder

Barry-Menglong-Yao commented 1 year ago

As we mentioned in the paper, "The BERT-based re-ranking model is pre-trained on the MS MARCO Passage Ranking dataset which is designed for text retrieval." You do not have to train cross-encoder since we used the pre-trained cross-encoder. See "cross_encoder_checkpoint" argument in retrieve_similarity_recall.py for the detail.

z27833009 commented 1 year ago
  1. I tried to run python retrieve_train.py --mode=test --train_config=CROSS_ENCODER but it keeps reporting errors due to possible lack of media parameter. After I set --media=""txt, it seemed that problem has been solved.image
  2. but as the problem I mentioned above says, an other error was caused by the lack of query_result_txt.csv, and I searched through the files and couldn't find the file either. Could you mind to telling me where i can find/generate this file image
  3. The last question is, for retrieve_similarity_recall.py, how should I change the default path of in_dir, it doesn't seem like its defaults path is /data/test. image
Barry-Menglong-Yao commented 1 year ago

Sorry. We just caught up on the conference deadline.

  1. Thanks. Yes, you can specify the media argument.
  2. query_result_txt.csv is generated after you train the text biencoder. The corresponding generation code is in https://github.com/Barry-Menglong-Yao/misinformation_detection/blob/65e50107be3c444ae740d48df56f45c58f296014/retrieval/eval/evaluator.py#L55
  3. in_dir is the path you put your corresponding files. For example, if your file is data/test/Corpus2_for_retrieval.csv, then the in_dir= data/test, instead of /data/test. If the path is not correct, you can try to debug it.
OPilgrim commented 11 months ago

I've also come across situations where files don't exist: FileNotFoundError: [Errno 2] No such file or directory: '/data/Projects/Mocheg/data/images/00017-proof-06-GettyImages-1137888397.jpg' when I run python retrieve_similarity_recall.py --bi_encoder_checkpoint=/data/Projects/Mocheg/retrieval/output/runs_3/00005-train_bi-encoder-multi-qa-MiniLM-L6-cos-v1-2023-11-24_14-10-21 --image_encoder_checkpoint=/data/Projects/Mocheg/retrieval/output/runs_3/00004-train_bi-encoder-clip-ViT-B-32-2023-11-24_11-18-44 --media=img_txt --top_k=10 --csv_out_dir=/data/Projects/Mocheg/data/test/retrieval/retrieval_result_10.csv

Barry-Menglong-Yao commented 11 months ago

@OPilgrim There is no 00017-proof-06-GettyImages-1137888397.jpg, however there is 00017-530390-06-GettyImages-1137888397.jpg. Could you debug the code to see why it searches for "00017-proof-06-GettyImages-1137888397.jpg"? Could you also share the complete error track, like in which function you encountered this error? The issue does not appear in my local running.

OPilgrim commented 11 months ago

@OPilgrim There is no 00017-proof-06-GettyImages-1137888397.jpg, however there is 00017-530390-06-GettyImages-1137888397.jpg. Could you debug the code to see why it searches for "00017-proof-06-GettyImages-1137888397.jpg"? Could you also share the complete error track, like in which function you encountered this error? The issue does not appear in my local running.

The problem occurred when retrieving images. At first, I thought there was a setting of content="proof", but after I printed out the value of content, there were only "all" and "img", so it was not clear where the proof in the image name came from

Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 96610/96610 [17:46<00:00, 90.59it/s]
0/2001: 0.3024411201477051, 0.2415764480829239
100/2001: 0.5788431167602539, 0.4707832932472229
200/2001: 0.5831706523895264, 0.4668898582458496
300/2001: 0.5832053422927856, 0.4636369049549103
400/2001: 0.5863810181617737, 0.48484358191490173
500/2001: 0.5832879543304443, 0.5186882019042969
600/2001: 0.5846362113952637, 0.5425591468811035
700/2001: 0.586365818977356, 0.5623708963394165
800/2001: 0.5859039425849915, 0.578498125076294
900/2001: 0.588605523109436, 0.5907866954803467
1000/2001: 0.5892195701599121, 0.6003651022911072
1100/2001: 0.5880477428436279, 0.5923115015029907
1200/2001: 0.5861196517944336, 0.6010985970497131
1300/2001: 0.5832314491271973, 0.603545069694519
1400/2001: 0.5815831422805786, 0.606799304485321
1500/2001: 0.5794385075569153, 0.6095340847969055
1600/2001: 0.5797507762908936, 0.6133111119270325
1700/2001: 0.5788807272911072, 0.6144909858703613
1800/2001: 0.5782501101493835, 0.6066485643386841
1900/2001: 0.5776910185813904, 0.5988103747367859
2000/2001: 0.5785987377166748, 0.5923917889595032
0.578887939453125, 0.5926878452301025,0.5857065916061401
ftfy or spacy is not installed using BERT BasicTokenizer instead of ftfy.
Images: 122246
0: 0.0, 0.0
Traceback (most recent call last):
  File "/data/Projects/Mocheg/retrieve_similarity_recall.py", line 48, in <module>
    main()
  File "/data/miniconda3/envs/mmfc/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/data/miniconda3/envs/mmfc/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/data/miniconda3/envs/mmfc/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/miniconda3/envs/mmfc/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/data/miniconda3/envs/mmfc/lib/python3.9/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/data/Projects/Mocheg/retrieve_similarity_recall.py", line 43, in main
    training_loop.training_loop(args,rank=0)
  File "/data/Projects/Mocheg/retrieval/training/training_loop.py", line 28, in training_loop
    image_retrieve(args,relevant_document_img_list,dataloader,saver)
  File "/data/Projects/Mocheg/retrieval/training/training_loop.py", line 116, in image_retrieve
    cur_precision,cur_recall=scorer.precision_recall_by_similarity(semantic_results,relevant_document_img_list,img_evidence_list,image_corpus)
  File "/data/Projects/Mocheg/retrieval/utils/metrics.py", line 67, in precision_recall_by_similarity
    retrieved_document_list,evidence_document_list=get_images(retrieved_document_name_list,evidence_document_name_list,img_folder)
  File "/data/Projects/Mocheg/retrieval/utils/metrics.py", line 76, in get_images
    evidence_document_list=[Image.open(os.path.join(img_folder,filepath)) for filepath in evidence_document_name_list]
  File "/data/Projects/Mocheg/retrieval/utils/metrics.py", line 76, in <listcomp>
    evidence_document_list=[Image.open(os.path.join(img_folder,filepath)) for filepath in evidence_document_name_list]
  File "/data/miniconda3/envs/mmfc/lib/python3.9/site-packages/PIL/Image.py", line 3243, in open
    fp = builtins.open(filename, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '/data/Projects/Mocheg/data/images/00017-proof-06-GettyImages-1137888397.jpg'
Barry-Menglong-Yao commented 9 months ago

Sorry for the late reply. After checking the dataset, we find that that specific image is missing in the released dataset. Sorry for the inconvenience. We have updated the dataset (click to download mocheg_with_tweet_2023_03.tar.gz). Do you mind redownloading the updated dataset? In the updated dataset, you should be able to find the image "00017-proof-06-GettyImages-1137888397.jpg" under the "mocheg/images" folder. Thanks!