facebookresearch / tart

Code and model release for the paper "Task-aware Retrieval with Instructions" by Asai et al.
Other
160 stars 11 forks source link

Missing Passage Embeddings for Cross-task Cross-domain #11

Open ParishadBehnam opened 1 year ago

ParishadBehnam commented 1 year ago

Dear Akari, Thank you for the great work and the detailed documentation on TART. I want to reproduce the cross-task cross-domain results. You said you have uploaded all the passages embeddings on Google drive. However, I only find the embeddings for Arguana, Climate-Fever, DBPedia, NQ, SciDocs (this dir is empty), Tourches, and Trec-Covid. I am looking for passage embeddings of AmbigQA, WikiQA, SciFact, GooAQ-Technical, LinkSO-Python, and CodeSearchNet-Python for cross-task retrieval. Can you please provide me with them?

Thank you :)

ParishadBehnam commented 1 year ago

Hello again @AkariAsai ,

Since I didn't get any responses, I tried to run cross-task retrieval from scratch. However, I don't get the same results as Table 4 (last row) of the paper! Could you please correct me if I am using an incorrect argument for the following steps?

Thank you :)

Embedding passages (do for all corpora in corss_task_cross_domain_final):

python generate_passage_embeddings.py \ --model_name_or_path facebook/tart-full-flan-t5-xl \ --output_dir embeddings/linkso \ --passages data/corss_task_cross_domain_final/linkso_py/corpus.jsonl \ --shard_id 0 --num_shards 1

Running cross-task:

python eval_cross_task.py \ --passages data/corss_task_cross_domain_final/nq/corpus.jsonl data/corss_task_cross_domain_final/scifact/corpus.jsonl data/corss_task_cross_domain_final/gooaq_med/corpus.jsonl data/corss_task_cross_domain_final/linkso_py/corpus.jsonl data/corss_task_cross_domain_final/ambig/corpus.jsonl data/corss_task_cross_domain_final/wikiqa/corpus.jsonl data/corss_task_cross_domain_final/gooaq_technical/corpus.jsonl data/corss_task_cross_domain_final/codesearch_py/corpus_new.jsonl \ --passagesembeddings "embeddings/linkso/passages" "embeddings/ambig/passages_" "embeddings/scifact/passages*" "embeddings/nq/passages" "embeddings/gooaq/passages_" "embeddings/codesearch/passages*" "embeddings/wikiqa/passages*" \ --qrels data/corss_task_cross_domain_final/linkso/qrels/test_new.tsv \ --output_dir logs/linkso_results \ --model_name_or_path facebook/tart-full-flan-t5-xl \ --projection_size 1024

hanseokOh commented 1 year ago

Hello ParishadBehnam, I did reproduce the results of last row of Table 4 (X2 setup) using following arguments (although some results are different). As I understand, you should retrieve first stage result using Retriever (Contriever-MSMARCO),which is used for generating embedding, and then use 'tart-full-flan-t5-xl' as a reranker for the second stage.

CKPT=./ckpt/tart-dual-contriever-msmarco
CE_CKPT=facebook/tart-full-flan-t5-xl

python generate_passage_embeddings.py --model_name_or_path $CKPT --output_dir ${OUTPUT_DIR_NAME}/embeddings/${DATA} \
      --passages ../../../data/corss_task_cross_domain_final/${DATA}/corpus.jsonl --shard_id ${i}  --num_shards 8 
python eval_cross_task.py \
    --passages ../../../data/corss_task_cross_domain_final/nq/corpus.jsonl ../../../data/corss_task_cross_domain_final/scifact/corpus.jsonl ../../../data/corss_task_cross_domain_final/linkso_py/corpus.jsonl ../../../data/corss_task_cross_domain_final/ambig/corpus.jsonl ../../../data/corss_task_cross_domain_final/wikiqa/corpus.jsonl ../../../data/corss_task_cross_domain_final/gooaq_technical/corpus.jsonl ../../../data/corss_task_cross_domain_final/codesearch_py/corpus.jsonl \
    --passages_embeddings "${OUTPUT_DIR_NAME}/embeddings/nq/passages_*" "${OUTPUT_DIR_NAME}/embeddings/scifact/passages_*" "${OUTPUT_DIR_NAME}/embeddings/linkso_py/passages_*" "${OUTPUT_DIR_NAME}/embeddings/ambig/passages_*" "${OUTPUT_DIR_NAME}/embeddings/wikiqa/passages_*" "${OUTPUT_DIR_NAME}/embeddings/gooaq_technical/passages_*" "${OUTPUT_DIR_NAME}/embeddings/codesearch_py/passages_*" \
    --qrels ../../../data/corss_task_cross_domain_final/${DATA}/qrels/test.tsv \
    --output_dir ${OUTPUT_DIR_NAME}/pooled-${DATA} \
    --model_name_or_path $CKPT \
    --data ../../../data/corss_task_cross_domain_final/${DATA}/queries.jsonl \
    --prompt  "${PROMPT}" \
    --ce_model $CE_CKPT \
    --ce_prompt "${PROMPT}"