etalab-ia / piaf-ml

PIAF v2.0 repo for ML development. Main purpose of this repo is to automatically find the best configuration for a QA pipeline of a partner organisation.
MIT License
8 stars 0 forks source link

Distiluse vs distilbert models #85

Closed psorianom closed 3 years ago

psorianom commented 3 years ago

Problem description When doing the cache-like operation, the EmbeddingRetriever model was changed from distiluse-base-multilingual-cased (the use in distiluse corresponds to Universal Sentence Embeddings) to distilbert-base-multilingual-cased in order to be able to track the model version with HF model's hub. The problem is that this model seem less performant and more importantly it seems that distilbert is no longer a sentence-bert model.

Solution description We should either find a work-around to keep using distiluse language model, or keep going with distilbert, accept the little downgrade in perf and change the nomenclature used in our code.

Who can help? @Rob192

Note Distiluse has 512-sized embeddings and distilbert has 768-sized embeddings. See https://github.com/UKPLab/sentence-transformers/issues/247 Also, I am actually not aware of the diff of distiluse vs distilbert in terms of perf :/

Rob192 commented 3 years ago

I understood your doubts are coming from the small drop in performance you observed on tiny.json when you updated the tests as part of #86.
We should evaluate on a larger dataset which is the best model to use. Then make the appropriate change you proposed. In addition we should update the tests for TitleQAPipeline

psorianom commented 3 years ago

I just tested the performance of distilbert-base-multilingual-cased (current model) vs distiluse-base-multilingual-cased (previously used model) with the DILA dataset. It is quite a manual process to choose one or the other, basically I have to:

  1. Change the dimension of the ES vectors (from 768 to 512 if using distiluse-base-multilingual-cased),
  2. Change the model_format parameter (in the instantiation of EmbeddingRetriever) to sentence_transformers if using distiluse-base-multilingual-cased.
  3. Change the embedding_model parameter to distiluse-base-multilingual-cased.

The results are quite surprising (quite an à l'arrache procedure) :

parameters = {
    "k_retriever": [1],
    "k_title_retriever" : [1], # must be present, but only used when retriever_type == title_bm25
    "k_reader_per_candidate": [20],
    "k_reader_total": [5],
    "reader_model_version": ["053b085d851196110d7a83d8e0f077d0a18470be"],
    "retriever_model_version": ["1a01b38498875d45f69b2a6721bf6fe87425da39"],
    "retriever_type": ["sbert"],  # Can be bm25, sbert, dpr, title or title_bm25
    "squad_dataset": ["./clients/dila/knowledge_base/squad.json"],
    "filter_level": [None],
    "preprocessing": [False],
    "boosting" : [1], #default to 1
    "split_by": ["word"],  # Can be "word", "sentence", or "passage"
    "split_length": [1000],
    "experiment_name": ["test"]
}

Reader Accuracy:0.29545454545454547 with distiluse-base-multilingual-cased (previously used)
Reader Accuracy:0.00909090909090909 with distilbert-base-multilingual-cased (currently used)

Based on these results (to further explore...) I suggest going back to the previously used model (distiluse) and check the sentence-transformers library to see if they handle model versions and expose that on haystack. Also we can try out this model already in HF's hub but fine-tuned for Estonian :shrug: https://huggingface.co/kiri-ai/distiluse-base-multilingual-cased-et (It is also quite horrible with this Estonian model: ~0.06 Acc)

psorianom commented 3 years ago

I will check if we can version disltiluse on sentence-bert. If not, let's go back to sbert without versioning

guillaumecherel commented 3 years ago

@Rob192 @psorianom I noticed that the sbert retriever in file retriever_eval_squad.py uses the distiluse and retriever_reader_eval_squad.py uses distilbert. Should we keep one over the other or keep them different in each file?

psorianom commented 3 years ago

Hi @guillaumecherel, I am on this. Should be fixed soon :) @guillim, this code is run at deployment, right? https://github.com/etalab-ia/piaf-ml/blob/854b09515f80bbf966ff756a3789fc91cc107929/deployment/roles/prepare_data/templates/insert_data.py#L114

guillim commented 3 years ago

yes, when we wan to inject the Knowledge base into ES

psorianom commented 3 years ago

Is it possible to deploy a PR ? Cause I have to modify this file and I don't want to mess production :)

guillim commented 3 years ago

Is it possible to deploy a PR ? Cause I have to modify this file and I don't want to mess production :)

What do you mean deploy a PR ? like deploy in produciton from a branch ? that's possible yes

psorianom commented 3 years ago

yes, I mean that. I believe this PR https://github.com/etalab-ia/piaf-ml/pull/149 is close to be merged but I guess it would be wise to check that the changes made to insert_data.py do not break the deployment.

guillim commented 3 years ago

Sounds good to me. Tried for the DILA and works perfectly fine.