deepset-ai / haystack

:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
14.85k stars 1.73k forks source link

Input to FarmRanker model #1258

Closed shrinivasait closed 3 years ago

shrinivasait commented 3 years ago

Question Can you please tell me what and how an input is passed to the ranker model?. searched on web but there is no result related to that. It would be helpful if you provide me proper way of training the FARMRanker.

julian-risch commented 3 years ago

Hi @shrinivasait a few minutes ago we merged this PR: https://github.com/deepset-ai/haystack/pull/1209 which adds also a SentenceTransformersRanker to Haystack. Maybe one of the models here is helpful for you so that you don't need to train a new model? https://huggingface.co/cross-encoder

If you would like to train a new model for the FARMRanker, what you are looking for is TextPairClassification within FARM. Here is example code that shows how to train such a model: https://github.com/deepset-ai/FARM/blob/master/examples/text_pair_classification.py The data format is also shown in the example. You need pairs of text: a question and a document text. The label is "1" if the document text is relevant to the query. Otherwise the label is "0". Does that make sense? Happy to help if you have any other questions!

shrinivasait commented 3 years ago

If I have corpus document of say 1000 to 10,000 lines. I need to sort out the similar text in the given documents and rank it accordingly. Can it be done using farm ranker.?

shrinivasait commented 3 years ago

And can you please provide a document or file for the sentencetransformerranker u mentioned.

clarahohohoho commented 3 years ago

I have the same issue too! Wanted to include the FARMRanker into my pipeline, and I received the error: "Exception: Input does not have the expected format". I am not training my own FARM Ranker, just want to inference it from one of the pretrained models from Huggingface. Is it possible, and if it is, what is the input format, thank you!

julian-risch commented 3 years ago

I have the same issue too! Wanted to include the FARMRanker into my pipeline, and I received the error: "Exception: Input does not have the expected format". I am not training my own FARM Ranker, just want to inference it from one of the pretrained models from Huggingface. Is it possible, and if it is, what is the input format, thank you!

Hi @clarahohohoho I replied to your issue here: https://github.com/deepset-ai/haystack/issues/1261

julian-risch commented 3 years ago

@shrinivasait If your documents are 1,000 to 10,000 lines, would you like to find similar shorter passages within these documents? Is this open issue here what you are looking for? https://github.com/deepset-ai/haystack/issues/1091

Here is a link to the SentenceTransformersRanker with some exemplary code: https://github.com/deepset-ai/haystack/blob/dbb9efbd39b1acd136a62e54a5f7beefe9bb4fb5/haystack/ranker/sentence_transformers.py#L30 However, I am not sure whether I understood your use case completely and whether SentenceTransformersRanker and FARMRanker can be of help.

shrinivasait commented 3 years ago

See, my question is simple let me expand it. I have some query given i need the to sort out the documents related to that query. second thing i need to check the similarity of the one document with the other given documents. can u please help me on this.

julian-risch commented 3 years ago

For the first step, you can definitely use FARMRanker or SentenceTransformersRanker in the following way:

...
retriever = ElasticsearchRetriever(document_store=document_store)
ranker = SentenceTransformersRanker(model_name_or_path="cross-encoder/ms-marco-MiniLM-L-12-v2") # more models here: https://huggingface.co/cross-encoder
#alternative: ranker = FARMRanker(model_name_or_path="nboost/pt-tinybert-msmarco") # more models here: https://huggingface.co/nboost
p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=ranker, name="Ranker", inputs=["ESRetriever"])

Note that the Ranker is used for Document Re-Ranking as described in the documentation: https://haystack.deepset.ai/docs/latest/rankermd#Ranker Before re-ranking, there is still a retriever, which could be an ElasticsearchRetriever as in the example here or it could be for example a DPR Retriever or Tf-idf Retriever.

julian-risch commented 3 years ago

However, for your second use case you need a different approach. The Ranker models are trained to compare the similarity of a query and a document but not to compare the similarity of two documents. I think what makes sense here is to use the EmbeddingRetriever https://haystack.deepset.ai/docs/latest/retrievermd#Embedding-Retrieval

shrinivasait commented 3 years ago

ok thank you for answering. It helped alot

julian-risch commented 3 years ago

You're welcome! I will close this issue for now. If there is still something unclear about FARMRanker, feel free to reopen it.