castorini / pygaggle

a gaggle of deep neural architectures for text ranking and question answering, designed for Pyserini
http://pygaggle.ai/
Apache License 2.0
338 stars 98 forks source link

How to use monot5-large-msmarco on huggingface #226

Closed Elfsong closed 2 years ago

Elfsong commented 2 years ago

I found you have uploaded the ' monot5-large-msmarco' on huggingface. You said, "For more details on how to use it, check pygaggle.ai" However, I cannot find where is 'pygaggle.ai'...

Can you share a tutorial about how to use this model? Thank you.

rodrigonogueira4 commented 2 years ago

Thanks for pointing this out. I added instructions to the model card.

If you want the best zero-shot performance (i.e., on datasets that are different from MS MARCO), I suggest using the models trained for only 10k steps (ex: monot5-base-msmarco-10k or monot5-large-msmarco-10k)

Elfsong commented 2 years ago

@rodrigonogueira4 Thank you for your reply.

To recreate the result of vert5erini, I have followed the README instruction a few days ago. However, the performance is not good as expected on the SciFact dataset. Here is my script, can you help me to have a look?

from pygaggle.rerank.base import Query, Text
from pygaggle.rerank.transformer import MonoT5

# Retrival Top 20 abstracts by BM2.5
top_k = 20

retrieval_results = defaultdict(set)

for data in tqdm(scifact_claims["train"]):
    claim = data['claim']
    claim_id = data['id']
    for hit in searcher.search(claim, top_k):
        retrieval_results[claim_id].add(int(hit.docid))

evaluate(ground_truth_results, retrieval_results)
# 100%|██████████| 1261/1261 [00:00<00:00, 1355.01it/s]
# Hit one: 0.9802
# Hit all: 0.9716

# Loding MonoT5 model
ranker =  MonoT5()

# Retrival Top 3 abstracts from above results by T5
top_k = 3

final_retrieval_results = defaultdict(set)

for claim_id in tqdm(retrieval_results):
    abstract_candidates = list(retrieval_results[claim_id])

    claim = scifact_claim_dict[claim_id]
    claim_content = preprocess_sentence(claim["claim"])
    query = Query(claim_content)

    # Construct text
    texts = list()

    for doc_id in abstract_candidates:
        doc = scifact_corpus_dict[doc_id]
        doc_content = doc['title'] + ' '.join(doc['abstract'])
        doc_content = preprocess_sentence(doc_content)
        texts += [Text(doc_content, {'doc_id': doc_id}, 0)]

    ranked_results = ranker.rerank(query, texts)

    for ranked_doc in ranked_results[:top_k]:
        doc_id = ranked_doc.metadata["doc_id"]
        final_retrieval_results[claim_id].add(doc_id)
# 100%|██████████| 809/809 [04:44<00:00,  2.84it/s]

evaluate(ground_truth_results, final_retrieval_results)
# Hit one: 0.4722
# Hit all: 0.4561

From the vert5erini's readme, I found you got a much higher result for the retrieval stage of the pipeline:

Hit one: 0.9567
Hit all: 0.9367
rodrigonogueira4 commented 2 years ago

I think @ronakice might be able to give better advice regarding scifact data. In the meantime, try this checkpoint instead of the default one: https://huggingface.co/castorini/monot5-base-msmarco-10k

Here are the steps to replace the default checkpoint with a different one: https://github.com/castorini/pygaggle/#reranking-with-a-different-checkpoint

Elfsong commented 2 years ago

Thank you so much. I have replaced the checkpoint but the result doesn't change obviously.

Elfsong commented 2 years ago

I forgot to sort the output...