Integrate ORQA and REALM for Open Domain Question Answering

antoniolanza1996 commented 4 years ago

These two systems are very promising in the context of Open Domain Question Answering.

ORQA paper: https://arxiv.org/abs/1906.00300
REALM paper: https://arxiv.org/abs/2002.08909

Describe the solution you'd like They worked similarly to DPR (i.e. with two BERT encoders in the retriever step, one for questions and one for documents). Additional context Considering that the code has been released recently, unfortunately models are not uploaded in HF/Transformers. However, there are already 2 issues which request the models addition (https://github.com/huggingface/transformers/issues/6456 and https://github.com/huggingface/transformers/issues/3497).

┆Issue is synchronized with this Jira Task by Unito

Utomo88 commented 4 years ago

https://ai.googleblog.com/2020/08/realm-integrating-retrieval-into.html

https://github.com/google-research/language/tree/master/language/realm

Timoeller commented 4 years ago

REALM and ORQA are of course very influential papers in our domain and we liked the idea of joint end-to-end training of retriever and reader. Pretraining seems to improve zero shot recall for the retriever (See REAM ablation studies, table 2, row 2). This is of course very interesting when you want to develop retrievers that work on out of domain data (or even other languages) like we do. The actual engineering to do the training, especially creation and periodic updating of the document index for getting a training signal seems amazing, yet rather complex. I doubt we can get the training part into pytorch or haystack quickly.

We also see the retriever part of these papers similar to DPR and are interested in getting this into haystack. It seems as if the pretrained retriever weights (QueryEmbedder and DocumentEmbedder) could be easily plugged into the existing DPR code (we are currently changing from fb dpr code to HF dpr code in #308). We would only need conversion of tf Bert models to pytorch + add the last linear down projecting layer. My speculation of what the data in the GCP bucket (linked on REALM github) is:

cc_news_pretrained/embedder/ are the retriever files
cc_news_pretrained/embedder/saved_model.pb should be the pretrained downprojecting layers for question and document retrievers because the file is just 46MB large.
I only see one /realm-data/cc_news_pretrained/embedder/variables/variables.data-00000-of-00001 of size 420MB, which seems like the BERT they use as base encoder model. So since there is only one BERT model I believe they use tied weights for query and document embedder? From a quick scan of the papers I could not verify this hypothesis...

@antoniolanza1996 do you see it the same way or have other opinions, especially about the model files and conversion strategy? Would you like to work on this with out support?

antoniolanza1996 commented 4 years ago

Hi @Timoeller, thanks for considering this enhancement.

I totally agree with you on the training part. It could be really complex to integrate on the current Haystack version.

However, I also thought, like you, of something similar to DPR implementation. In particular, we can consider the retriever part composed by a BiEncoder (i.e. question encoder and document encoder). With this "abstraction", if one wants to use DPR as retriever component, he/she could use something similar to: biencoder = BiEncoder(question_encoder="dpr-question_encoder-single-nq-base", document_encoder="dpr-ctx_encoder-single-nq-base") Instead, if one wants to use ORQA: biencoder = BiEncoder(question_encoder="orqa-question_encoder", document_encoder="orqa-document_encoder") Similarly, for REALM: biencoder = BiEncoder(question_encoder="realm-question_encoder", document_encoder="realm-document_encoder")

If the models will be uploaded in HF, there will be no problems to quickly integrate ORQA e REALM in Haystack. In my main comment I have cited two HF issues which ask for models addition, however I don't know if someone will take them into account (I hope so).

Meanwhile, I have read ORQA and REALM code and I have come up with this speculations:

Files under realm-data/cc_news_pretrained/ are outputs of pre-training step. In particular, starting from ICT pre-trained model (introduced in ORQA paper), they apply the REALM pre-training and obtain 2 embedder models (1 for retrieval (i.e. realm-data/cc_news_pretrained/embedder/) and 1 for answer extraction (i.e. realm-data/cc_news_pretrained/bert/)).
At this point, I asked myself (and I noted that you had my same doubt) question encoder == document encoder? I have better investigated and I have also talked with one of the papers authors to understand something more. He said to me that, at the end of pre-training, indeed they are the same! However, after pre-training, we have to perform document indexing (thus using realm-data/cc_news_pretrained/embedder/) and then fine-tuning. During fine-tuning, document encoder is not considered (clearly because all the documents have been already indexed). But the question encoder (together with reader encoder) is trained. Thus, weights between document encoder (i.e. realm-data/cc_news_pretrained/embedder/) and question encoder (i.e. fine-training retriever output) may change. Indeed for this reason we can state that question encoder != document encoder.
Where should we find the question encoder and reader encoder? We have to see at the end of fine-tuning step (i.e. after orqa_experiment.py run). Here, TensorFlow saves both the models in one checkpoint (+++).

Thus, after this long (and boring) comment, I think that:

REALM document encoder could be downloaded from realm-data/cc_news_pretrained/embedder/
REALM question encoder and REALM reader encoder should be extracted from realm-data/orqa_nq_model_from_realm/export/best_default/checkpoint/ (or realm-data/orqa_wq_model_from_realm/export/best_default/checkpoint/ if we want the models fine-tuned on WebQuestions dataset).

Hope it helps. I will wait for your reply.

(+++) To confirm that TensorFlow saves both the models in one checkpoint, please consider authors example on NQ dataset. Their best checkpoint is stored in realm-data/orqa_nq_model_from_realm/export/best_default/checkpoint/. In terms of storage, this directory is approximately 6x realm-data/cc_news_pretrained/embedder (clearly don't consider realm-data/cc_news_pretrained/embedder/encoded/ which contains document embeddings). According to this issue, all it makes sense (indeed 3x for BERT question encoded + 3x for BERT reader encoder), thus in this checkpoint there are 2 BERT models.

antoniolanza1996 commented 4 years ago

I have also uploaded a notebook here if it can help to better understand what I have already tried.

Timoeller commented 4 years ago

Sorry for the late reply. This is indeed a detailed analysis of the pretrained models : )

During fine-tuning, document encoder is not considered (clearly because all the documents have been already indexed).

This I do not understand, since they have engineered this update-able index. Why not update the index for finetuning, too, to jointly train question AND document embedder.

Nevertheless, I think your suggestion of having a Biencoder, regardless of the underlying model, will be very useful and we will incorporate this in future releases. I would also prefer waiting for a transformers integration of REALM retrievers first. The finetuned models on NQ or WQ would make a nice comparison against DPR but the zero shot capabilities of the pretrained only retrievers are especially interesting to us.

Are you actively working on this? Getting the retriever weights into pytorch - and possibly wrap them around DPR transformers code should be doable - though we currently do not have one of our engineers planed for it.

antoniolanza1996 commented 4 years ago

Hi @Timoeller, I will try to better explain my findings.

This I do not understand, since they have engineered this update-able index. Why not update the index for finetuning, too, to jointly train question AND document embedder.

The update-able index is only used during pre-training. Indeed, the main goal of REALM paper is to introduce an outstanding pre-training step.

However, when the model should be fine-tuned for the task of Open-domain Question Answering, they states here that we have to use ORQA codebase. According to ORQA README, we have to run an orqa_experiment and this experiment uses orqa_model.py. Reading this code you can see that question embedder is initialised here using the BERT model stored after pre-training. Hence, during fine-tuning, its parameters could be fine-tuned accordingly. However, there is no document embedder loaded similarly to question embedder. Only here a ScaNN searcher is loaded. It contains the already indexed documents (i.e. document embeddings). This means that the document embedder will not fine-tuned (because they suppose that ALL the documents have already been indexed in ScaNN).

I have also found this comment in section 2.3 of this new paper:

"After pretraining, both ORQA and REALM freeze the passages index and encoder and subsequently fine-tune the question encoder to retrieve passages whose content helps a jointly-trained reader model extract the correct answer"

The meaning of this quoted sentence is really similar to what I have tried to explain.

Are you agree with me or is there something wrong?

antoniolanza1996 commented 4 years ago

I would also prefer waiting for a transformers integration of REALM retrievers first.

All right, it makes sense.

Are you actively working on this?

No, I have postponed this work as it is not strictly imminent to implement for my thesis. I will consider it in the coming weeks.

Timoeller commented 3 years ago

Are you agree with me or is there something wrong?

I agree with your explanation and findings. Not updating the doc index seems like a Google thing, you might want to find strategies to not update your whole internet index while improving performance through adjusting the query embedder only.

The ColBERT paper you referenced is also interesting. Really excited about so much progress in the field - on the other hand they possibly did not conduct thorough related work research. Have you seen this fb paper, it introduces a similar concept. Figure 1 gives a really good overview of bi encoders, crossencoders and poly encoders that could be seen as ColBERTs late interaction.

I will consider it in the coming weeks.

More than happy to collaborate to get these features into haystack once your thesis is finished.

antoniolanza1996 commented 3 years ago

I have never seen Poly-Encoders before but, after a first glance, I agree with you. Indeed, the main idea is similar.

As you said, ColBERT is really interesting. But, I think, it is more computational intensive compared to other dense approaches (e.g. DPR, ORQA, REALM) at inference time. Indeed, ColBERT requires a lot of memory and more math operations to compute MaxSim. As a simple example: For a single document, ORQA/REALM saves one 128-dim vector and DPR saves one 768-dim vector. Instead, ColBERT should save 120-150 128-dim vectors. Note: I have written 120-150 because ColBERT authors use the same wiki split as DPR. In this wiki split, each passage contains 100 words. Hence, in average, 100 words are approximately 120-150 tokens with DPRTokenizer.

I think that, after thorough tests, one could state if it is worth the risk. I hope so because the achieved results in the paper are really outstanding.

More than happy to collaborate to get these features into haystack once your thesis is finished.

As you previously advised, I have converted TF models in PyTorch ones and I have plugged them into DPR implementation. However, I had to use DPRTokenizer. In particular, I have used DPR checkpoint for query_tokenizer and passage_tokenizer and my converted checkpoint for query_encoder and passage_encoder (related to these lines).

I have tested both ORQA and REALM retriever models and they seems to work. In the coming weeks I'll run more intensive tests and I let you know if there are problems.

If you want, I can share PyTorch converted models, but this is not a real solution, it's just a rough "hack" that I have done to test these trained checkpoints. Hopefully, models will be published in HF. Hence, we can use them in a BiEncoder abstraction (as previously discussed).

antoniolanza1996 commented 3 years ago

About tokenizer, reading here, it seems that DPR uses exactly BertTokenizer.

I have better searched in GCP bucket and the only vocab.txt files found are here:

gs://orqa-data/ict/assets/vocab.txt
gs://realm-data/cc_news_pretrained/bert/assets/vocab.txt

And they are exactly the same.

To be 100% correct, should I feed this vocab.txt instead of using the Bert one?

Even though, I have compared these 2 vocab files and there are only some slightly differences (e.g. ##Â¢ in Bert vs ##¢ in ORQA/REALM).

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs.

Timoeller commented 3 years ago

This seems to be an interesting use case. Unfortunately there is not much progress but I would like to keep this issue open as a reminder (especially if the models should be integrated in transformers).

mchari commented 3 years ago

@antoniolanza1996 , looks like you were able to convert the realm tf models into pytorch and get them into DPR encoders. Would you be able to share that code that converts the realm models into pytorch ? I used Huggingface's convert_bert_original_tf_checkpoint_to_pytorch.py to convert the question encoder checkpoint, but I got errors...

antoniolanza1996 commented 3 years ago

Hi @mchari, I've done this conversion some months ago but I didn't use any original conversion code. I've looked into Tensorflow and PyTorch variable names both for REALM and DPR and, according to some speculations (which clearly could be wrong), I've come up with this conversion notebook: https://github.com/antoniolanza1996/miscellaneous/blob/e2ba55badc1bf5751c73013a75c3ffa1bfc7fc03/haystack/issues/312/REALM_convert_TF_to_PyTorch_variables.ipynb. Hope this help...

However, consider that using ORQA and REALM only for retrieval is not good as DPR. Please see table 1 of ColBERT-QA paper which report these values on NQ dataset:

ORQA 13.9
REALM 38.5
DPR 67.1
ColBERT-QA 77.4

However, I've noted that ORQA and REALM can reduce this gap when you compare the entire pipeline (i.e. retriever+reader) on the EM metric (i.e. ORQA and REALM reader >> DPR reader???).

But, to be coherent with Haystack structure, I prefer to always use DPR as retriever and choose the reader between the numerous models fine-tuned on SQuAD 2.0 provided in HF model hub.

mchari commented 3 years ago

Thanks @antoniolanza1996 , for providing the code and the additional insight. My use case is not open QA. I just want to add a semantic component to a document search engine. As @Timoeller pointed out above, the zero-shot retrieval advertised by REALM is what we are trying to see if it helps in my case... I have seen that for my use case, zero-shot QA (given the right passage) returns the right answer most of the time. I have already tried DPR but the retrieval wasn't very good. I will also look into ColBERT-QA for completeness.

antoniolanza1996 commented 3 years ago

I have already tried DPR but the retrieval wasn't very good.

You can also try to fine-tune DPR with your data. I've noted some benefits doing it.

If you find something useful when you'll try REALM, please share your findings with the whole community :grinning:

mchari commented 3 years ago

Yes, that is the parallel plan of attack.

If you find something useful when you'll try REALM, please share your findings with the whole community 😀

Definitely. Will do.

mchari commented 3 years ago

When I create DPRContextEncoder and DPRQuestionEncoder from the pytorch converted checkpoints(many thanks to @antoniolanza1996 !) , I get the following messages from transformers/modeling_utils.py related to missing keys in the checkpoints.

Some weights of DPRQuestionEncoder were not initialized from the model checkpoint at question_checkpoint_REALM and are newly initialized: ['bert_model.embeddings.position_ids'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Some weights of DPRContextEncoder were not initialized from the model checkpoint at document_checkpoint_REALM and are newly initialized: ['bert_model.embeddings.position_ids'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Seems like that these warnings can be ignored. I do get embeddings for my questions, but none for the context... Investigating...

tholor commented 3 years ago

@mchari Looks like you are making great progress on the conversion! Great to see the community collaborating like this :)

['bert_model.embeddings.position_ids']

Yes, you can ignore this warning. It's due to a recent transformers change and has no impact on performance.

Let us know if you face any blockers. Happy to support!

mchari commented 3 years ago

@tholor , thanks for your support. Open source at its best :-)

I was debugging why Elasticsearch query_by_embedding gives an error, so I decided to sync up my code with the latest FARM and haystack repositories.

Using torch 1.6.0 tensorflow 2.3.1 transformers 3.1.0 farm-haystack 0.3.0

I am now getting KeyError['dpr'] when I try a documented invocation of DensePassageRetriever, for which I have filed a Issue ticket.

mchari commented 3 years ago

Thanks to the REALM tensorflow to pytorch conversion code provided by @antoniolanza1996. I followed Tutorial 6 to do semantic retrieval using dual encoders initialized with the converted checkpoints. I locally edited dense.DensePassageRetriever() to initialize the tokenizer models without REALM models: self.query_tokenizer = Tokenizer.load(pretrained_model_name_or_path="facebook/dpr-question_encoder-single-nq-base", do_lower_case=True, use_fast=use_fast_tokenizers)

self.passage_tokenizer = Tokenizer.load(pretrained_model_name_or_path="facebook/dpr-ctx_encoder-single-nq-base", do_lower_case=True, use_fast=use_fast_tokenizers)

The results are actually not very encouraging for my document set. I am getting much better results with the dpr-ctx... and dpr-question... models.

I am attaching the code so anyone can verify it for their use case. realm.zip

antoniolanza1996 commented 3 years ago

Hey @mchari, thanks for sharing your feedback.

The results are actually not very encouraging for my document set. I am getting much better results with the dpr-ctx... and dpr-question... models.

As I already mentioned in https://github.com/deepset-ai/haystack/issues/312#issuecomment-731652838, I think that ORQA and REALM aren't good as DPR on retrieval step. I also got your same results on other domains (e.g. on WikiMovies).

Probably it's required to run REALM pre-training and fine-tuning on our own data. But I didn't do that because I've obtained good results with:

DPR original checkpoints
DPR fine-tuned on my own data

If you think that could be promising in your use case, you can go through on ORQA/REALM pre-training/fine-tuning in order to improve results.

mchari commented 3 years ago

Yes, I have confirmed that the DPR checkpoints give way better results. Fine tuning DPR is my next step, before combining BM25 and DPR results. Good to see that FARM/Haystack provides the building blocks to try these out !

tholor commented 3 years ago

I locally edited dense.DensePassageRetriever() to initialize the tokenizer models without REALM models: self.query_tokenizer = Tokenizer.load(pretrained_model_name_or_path="facebook/dpr-question_encoder-single-nq-base", do_lower_case=True, use_fast=use_fast_tokenizers) ...

@mchari Just to make sure: Did you really use the DPR tokenizer ("facebook/dpr-question_encoder-single-nq-base") together with the REALM model? Haven't checked it, but I am pretty sure REALM has a completely different vocab and using DPR tokenizer will produce only gibberish here...

antoniolanza1996 commented 3 years ago

Hey @tholor , I've also used DPR tokenizer. There are only some slightly differences between DPR and REALM vocab files - please read here: https://github.com/deepset-ai/haystack/issues/312#issuecomment-685890745. However, I didn't investigate on the tokenizer output.

Do you think this could be a problem?

tholor commented 3 years ago

Ok, my bad. I was not aware that you already compared vocabs. Then this should be fine :+1: What were the slight differences you found (except ##Â¢ vs ##¢ )?

antoniolanza1996 commented 3 years ago

What were the slight differences you found (except ##Â¢ vs ##¢ )?

@tholor unfortunately I didn't go deeper on that but this happened only for weird characters like ¢ and every time BERT tokenizer puts Â before the character.

In my textual documents I didn't have these types of characters so I think this couldn't introduce noise in my results.

tholor commented 3 years ago

Got it - and order of vocab was also the same, right? We had sneaky bugs in the past where a vocab was shifted by one because of an extra token in vocab 1 compared to vocab 2.

antoniolanza1996 commented 3 years ago

and order of vocab was also the same, right?

I've analyzed the difference with git diff and, if I correctly remember (+), there wasn't that problem you're telling me.

(+) Consider that I've done this 3 months ago

mchari commented 3 years ago

Yet, it seemed like the REALM results were really bad, so I am now wondering whether I am missing anything... @antoniolanza1996 , were your REALM results much worse relative to the DPR models ?

antoniolanza1996 commented 3 years ago

@mchari yes, REALM results were much worse than DPR ones. In particular, I calculated the top-K accuracy for K=10 and K=100 in my use case:

top-10 accuracy REALM = 12.1
top-10 accuracy DPR = 52.8
top-100 accuracy REALM = 38.3
top-100 accuracy DPR = 68.3

I've two remarks:

I made some errors during conversion: this is possible considering that it's my first time working on that. Probably we can wait for an official PyTorch conversion in Tranformers (https://github.com/huggingface/transformers/issues/3497)
However, we have to consider that in the Natural Questions dataset benchmarks DPR obtained +28.6 on top-5 accuracy (i.e. REALM 38.5, DPR 67.1, I've reported here other info). So this makes me think getting such low results is a normal behavior.

antoniolanza1996 commented 3 years ago

And another important point is related to embeddings dimension: ORQA and REALM have 128-dimensional embeddings, DPR 768.

6x bigger embeddings could clearly help on giving more accurate results.

Timoeller commented 3 years ago

Hey @antoniolanza1996 thanks for adding your metrics here, that really helps comparing.

@mchari could you also add your evaluations metrics here, please? And please also do a quick sanity check: Are your REALM results significantly better than chance?

mchari commented 3 years ago

@Timoeller , right now my analysis is qualitative based on a handful of cases. I am in the process of creating training data to fine-tune DPR. Once that is done, I'll be in a position to have metrics for my results...

Timoeller commented 3 years ago

I was thinking of using: https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial5_Evaluation.py#L68 and benchmark your REALM embeddings with the official DPR ones with existing code and data.

But if you are creating domain train data and want to finetune DPR anyways, that is also fine.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 21 days if no further activity occurs.

anakin87 commented 6 months ago

These techniques have become obsolete today.

deepset-ai / haystack

Integrate ORQA and REALM for Open Domain Question Answering #312