Recommendation for German DPR / German Retriever/Reader Pair

deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

https://haystack.deepset.ai

Apache License 2.0

16.79k stars 1.84k forks source link

Recommendation for German DPR / German Retriever/Reader Pair #198

Closed JulianGerhard21 closed 4 years ago

JulianGerhard21 commented 4 years ago

Hi guys,

first of all: awesome work with Farm and Haystack. I am currently exploring the possibilities on a private project - thanks for actually providing this toolset!

My situation is the following:

I have three "datasets", which are basically "large", domain specific documents. The smallest has ~500 sentences, consisting of ~12500 tokens, the biggest has ~8500 sentences, consisting of ~130000 tokens. They are'nt labeled in any form and far away from being formatted SQuAD-like. Currently I decided to simply run two of your recommended scenarios on those documents:

Scenario 1

Skipping the preprocessing part for the moment
Using an ElasticSearchRetriever with default configuration
Using a FARMReader with deepset/bert-large-uncased-whole-word-masking-squad2

Scenario 2

Skipping the preprocessing part for the moment
Using a DensePassageRetriever with dpr-bert-base-nq
Using a FARMReader with deepset/bert-large-uncased-whole-word-masking-squad2

The results are surprisingly good, ignoring the fact that everything has been designed to work with the english language. However I currently observe, that the ElasticSearchRetriever performs better than the DPR - which makes sense regarding the BM25 / tokenization.

I am convinced, that it would be worth the try to adapt everything / finetune the steps for the german language. Therefore, I have the following questions:

Are you aware of any pretrained german SQuAD-like transformer publicly available?
If not, would you recommend to use a multilingual model (if so, which one?)?
Can you briefly describe how to train my own DPR? German training data itsself won't be the problem
Any recommendations in general to adapt your system for another language?

Thanks in advance, of course I would share the results / process with the community.

Kind regards Julian

tholor commented 4 years ago

Hey @JulianGerhard21 ,

Glad that you like it and happy to hear that you get some decent results for German even with an "English Pipeline"!

To your questions:

Are you aware of any pretrained german SQuAD-like transformer publicly available?

No, but we are already working on one ;). After some experiments on an auto-translated SQuAD dataset that gave okayish results, we'll create some high-quality human annotations to boost the model performance.

If not, would you recommend to use a multilingual model (if so, which one?)?

Possible. It will be better than the pure English one that you use right now. I would recommend training an XLM-R model.
Either train one yourself on SQuAD 2 (e.g. via this script in FARM) or try one of the existing models out there:

Could be interesting as it's trained on the multilingual XQUAD dataset incl. some german data https://huggingface.co/mrm8488/xlm-multi-finetuned-xquadv1
Large XLM-R trained on SQuAD 2 https://huggingface.co/a-ware/xlmroberta-squadv2

Can you briefly describe how to train my own DPR? German training data itsself won't be the problem

So far, we only have DPR inference implemented in Haystack. For training, you would need to use the original code base. Having said this, we believe the training there is a bit complicated and we'll work on a simpler way that you can ideally run directly from Haystack. In any case, you will need training samples of query + "related passage". Not sure if you already have this available?

Any recommendations in general to adapt your system for another language?

If you use BM25, there's additional options to adjust stopwords, stemming etc. in Elasticsearch that can be useful for retrieval. Also, you might be interested in multilingual QA datasets that usually don't have enough samples for training but can be helpful for evaluating: XQUAD, MLQA

Hope this helps!

JulianGerhard21 commented 4 years ago

Hi @tholor ,

awesome - thanks for those recommendations. I will go through them in detail. Just one quick comment: May it be, that for xlmroberta-squadv2 the vocab file is missing? I am getting an error in one of the transformer tools:

File "C:/Users//PycharmProjects/word_embeddings/projects/huethig/app/app/utils.py", line 69, in create_finder
    reader = FARMReader(model_name_or_path=model_name, use_gpu=False)
  File "C:\Users\\Envs\farm_env\lib\site-packages\haystack\reader\farm.py", line 90, in __init__
    doc_stride=doc_stride, num_processes=num_processes)
  File "C:\Users\\Envs\farm_env\lib\site-packages\farm\infer.py", line 212, in load
    tokenizer = Tokenizer.load(model_name_or_path)
  File "C:\Users\\Envs\farm_env\lib\site-packages\farm\modeling\tokenization.py", line 99, in load
    ret = RobertaTokenizer.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "C:\Users\\Envs\farm_env\lib\site-packages\transformers\tokenization_utils_base.py", line 1140, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "C:\Users\\Envs\farm_env\lib\site-packages\transformers\tokenization_utils_base.py", line 1287, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "C:\Users\\Envs\farm_env\lib\site-packages\transformers\tokenization_roberta.py", line 171, in __init__
    **kwargs,
  File "C:\Users\\Envs\farm_env\lib\site-packages\transformers\tokenization_gpt2.py", line 167, in __init__
    with open(vocab_file, encoding="utf-8") as vocab_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType

Kind regards Julian

JulianGerhard21 commented 4 years ago

Good morning @tholor ,

the last comment was my mistake - I experimented a bit with the different types of readers, everything worked fine with a Transformer reader.

However something came up to my mind. I am not quite familiar yet with QA, but at least in general document classification tasks an unseen sample usually needs to take its way through the same preprocessing pipeline as the training data did.

Currently, the architecture consists of the retriever and the finder and I am wondering, how the preprocessing influences the behaviour/quality of those two crucial parts, if let's say we remove all stop words?

Kind regards Julian

tholor commented 4 years ago

Sorry, I missed your last two comments here @JulianGerhard21!

Currently, the architecture consists of the retriever and the finder and I am wondering, how the preprocessing influences the behaviour/quality of those two crucial parts, if let's say we remove all stop words?

Stop word removal is usually not a good idea for transformer models as they were trained on natural language (incl. stopwords) and rely on them quite heavily to create meaningful contextual representations. The only preprocessing that I see:

Cleaning and splitting of indexed documents (e.g. removal of tables): Important, but only applied to docs (not queries) and therefore not part of the finder.
Tokenization: Fundamental for both documents and queries. It's part of the Reader (and some Retriever types) and will therefore be applied in the same way to documents and queries when you run the Finder.

If you have other preprocessing options in mind that would help, let me know :)

tholor commented 4 years ago

@JulianGerhard21 Hope that helped! Please close the issue, if you think it is resolved. We will of course give an update once we have a German DPR model or German QA Reader.

Timoeller commented 4 years ago

We also trained a large XLM model on SQuAD and evaluated it on German QA data. Details you can find here: https://huggingface.co/deepset/xlm-roberta-large-squad2

tholor commented 4 years ago

Closing this for now. Feel free to re-open if further clarifications are needed :)

sb2202 commented 3 years ago

If you use BM25, there's additional options to adjust stopwords, stemming etc. in Elasticsearch that can be useful for retrieval. Also, you might be interested in multilingual QA datasets that usually don't have enough samples for training but can be helpful for evaluating: XQUAD, MLQA

Dear @tholor , in case of German documents how does one modify the elastic search retriever for better results?

tholor commented 3 years ago

Hey @sb2202,

I believe this is a rather general question regarding the "tuning of elasticsearch". I'd suggest you these directions:

Doc length: Try splitting your docs into different lengths. Elasticsearch's BM25 often won't work well on very short or extremely long docs
Analyzer: There are language-specific analyzers dealing with stemming, stopwords etc.
Synonyms: https://www.elastic.co/blog/boosting-the-power-of-elasticsearch-with-synonyms
Filter queries: If you supply metadata with your documents, you can use those as filters in your Elasticsearch Retriever and narrow down the search space (e.g. via categories) https://github.com/deepset-ai/haystack/blob/abda994116d0e6604b7162227aeb5ee91902069a/haystack/document_store/elasticsearch.py#L230-L236
Custom queries: Haystack's default is to search in the field "text". However, you can also supply a custom query to include other fields (e.g. a title) and weight them.
https://github.com/deepset-ai/haystack/blob/abda994116d0e6604b7162227aeb5ee91902069a/haystack/retriever/sparse.py#L21-L52

You will probably find many further options how to tune it - it's kind of a rabbithole ;)

Hope this helps!