deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.6k stars 1.91k forks source link

Getting stuck during DPR training #822

Closed voidful closed 3 years ago

voidful commented 3 years ago

Describe the bug I follow the DPR training example with all the default settings. (Training DPR from Scratch) However, when Preprocessing Dataset data/dpr_training/train/biencoder-nq-train.json, it will get stuck and no response.

Error message Preprocessing Dataset data/dpr_training/train/biencoder-nq-train.json: 11% 6288/58880 [00:19<00:29, 1799.01 Dicts/s]

Additional context I have tried using the latest FARM version from GitHub Updating transformers but all got no luck.

To Reproduce Run the script https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial9_DPR_training.ipynb

Timoeller commented 3 years ago

Hey @voidful I cannot reproduce your issue:

Preprocessing Dataset data/dpr_training/train/biencoder-nq-train.json: 100%|██████████| 58880/58880 [00:44<00:00, 1313.35 Dicts/s]

Did you check your RAM usage? Getting stuck during preprocessing without warning indicates you are swapping or are just OOM. Be aware that the unpacked train dataset is 7.4GB on disk and I have peak RAM usage during preprocessing of 30 GB.

Please do not use latest FARM master when running haystack DPR, we have made some changes in FARM to DPR preprocessing that are not tested in haystack yet. Please use FARM==0.6.2 as specified in the haystack requirements.

voidful commented 3 years ago

Hey @voidful I cannot reproduce your issue:

Preprocessing Dataset data/dpr_training/train/biencoder-nq-train.json: 100%|██████████| 58880/58880 [00:44<00:00, 1313.35 Dicts/s]

Did you check your RAM usage? Getting stuck during preprocessing without warning indicates you are swapping or are just OOM. Be aware that the unpacked train dataset is 7.4GB on disk and I have peak RAM usage during preprocessing of 30 GB.

Please do not use latest FARM master when running haystack DPR, we have made some changes in FARM to DPR preprocessing that are not tested in haystack yet. Please use FARM==0.6.2 as specified in the haystack requirements.

My ram usage is 40.5/126G, it seems fine. I have rollback to FARM==0.6.2, and the problem still exists.

Nevertheless, when I try to s set DataSilo max_processes=1 in DensePassageRetriever.train, it works. Maybe it's a multi-processing problem.

Timoeller commented 3 years ago

Ah nice, you seem to have a beefy machine : ) and also found a proper workaround.

Do you have a dockerized environment? There we often observe mp issues. If so did you try setting --ipc="host" when running the docker?

voidful commented 3 years ago

Ah nice, you seem to have a beefy machine : ) and also found a proper workaround.

Do you have a dockerized environment? There we often observe mp issues. If so did you try setting --ipc="host" when running the docker?

I have tried setting --ipc="host", and it works like a charm. Never see that coming...

tholor commented 3 years ago

@Timoeller How about we add a check in FARM and log an info if we are running in a container + multiprocessing is enabled. We can check via something like this https://stackoverflow.com/questions/20010199/how-to-determine-if-a-process-runs-inside-lxc-docker

Timoeller commented 3 years ago

Nice one @tholor since we have seen these problems quite often. Added https://github.com/deepset-ai/FARM/issues/717 in FARM

xpatronum commented 3 years ago

I've tried running the training DPR notebook in Colab having and also got stuck. The problem is that calc_chunk_size that is being used to calculate default max_chunk_size doesn't take into account available RAM.

def calc_chunksize(num_dicts, min_chunksize=4, max_chunksize=2000, max_processes=128):
    if mp.cpu_count() > 3:
        num_cpus = min(mp.cpu_count() - 1 or 1, max_processes)  # -1 to keep a CPU core free for xxx
    else:
        num_cpus = min(mp.cpu_count(), max_processes) # when there are few cores, we use all of them

    dicts_per_cpu = np.ceil(num_dicts / num_cpus)
    # automatic adjustment of multiprocessing chunksize
    # for small files (containing few dicts) we want small chunksize to ulitize all available cores but never less
    # than 2, because we need it to sample another random sentence in LM finetuning
    # for large files we want to minimize processor spawning without giving too much data to one process, so we
    # clip it at 5k
    multiprocessing_chunk_size = int(np.clip((np.ceil(dicts_per_cpu / 5)), a_min=min_chunksize, a_max=max_chunksize))
    # This lets us avoid cases in lm_finetuning where a chunk only has a single doc and hence cannot pick
    # a valid next sentence substitute from another document
    if num_dicts != 1:
        while num_dicts % multiprocessing_chunk_size == 1:
            multiprocessing_chunk_size -= -1
    dict_batches_to_process = int(num_dicts / multiprocessing_chunk_size)
    num_processes = min(num_cpus, dict_batches_to_process) or 1

    return multiprocessing_chunk_size, num_processes

I'm assuming not everyone has an opportunity to fine-tune DPR on their own machine with 40+ gb of RAMS. So I suggest to

  1. Improve calc_chunksize by taking the RAM into account.
  2. As a quick fix In tutorial - pass manually the max_chunksize=300 when creating DataSilo object.
    # 3. Create a DataSilo that loads several datasets (train/dev/test), provides DataLoaders for them and calculates a few descriptive statistics of our datasets
    # NOTE: In FARM, the dev set metrics differ from test set metrics in that they are calculated on a token level instead of a word level
    data_silo = DataSilo(processor=processor, 
                     batch_size=batch_size, 
                     distributed=distributed,
                     max_multiprocessing_chunksize=300)

    The full notebook that I used to successfully train DPR from scratch using xlm-roberta-base version is here

Let's discuss it here or reopen new issue?

voidful commented 3 years ago

I've tried running the training DPR notebook in Colab having and also got stuck. The problem is that calc_chunk_size that is being used to calculate default max_chunk_size doesn't take into account available RAM.

def calc_chunksize(num_dicts, min_chunksize=4, max_chunksize=2000, max_processes=128):
    if mp.cpu_count() > 3:
        num_cpus = min(mp.cpu_count() - 1 or 1, max_processes)  # -1 to keep a CPU core free for xxx
    else:
        num_cpus = min(mp.cpu_count(), max_processes) # when there are few cores, we use all of them

    dicts_per_cpu = np.ceil(num_dicts / num_cpus)
    # automatic adjustment of multiprocessing chunksize
    # for small files (containing few dicts) we want small chunksize to ulitize all available cores but never less
    # than 2, because we need it to sample another random sentence in LM finetuning
    # for large files we want to minimize processor spawning without giving too much data to one process, so we
    # clip it at 5k
    multiprocessing_chunk_size = int(np.clip((np.ceil(dicts_per_cpu / 5)), a_min=min_chunksize, a_max=max_chunksize))
    # This lets us avoid cases in lm_finetuning where a chunk only has a single doc and hence cannot pick
    # a valid next sentence substitute from another document
    if num_dicts != 1:
        while num_dicts % multiprocessing_chunk_size == 1:
            multiprocessing_chunk_size -= -1
    dict_batches_to_process = int(num_dicts / multiprocessing_chunk_size)
    num_processes = min(num_cpus, dict_batches_to_process) or 1

    return multiprocessing_chunk_size, num_processes

I'm assuming not everyone has an opportunity to fine-tune DPR on their own machine with 40+ gb of RAMS. So I suggest to

  1. Improve calc_chunksize by taking the RAM into account.
  2. As a quick fix In tutorial - pass manually the max_chunksize=300 when creating DataSilo object.
# 3. Create a DataSilo that loads several datasets (train/dev/test), provides DataLoaders for them and calculates a few descriptive statistics of our datasets
# NOTE: In FARM, the dev set metrics differ from test set metrics in that they are calculated on a token level instead of a word level
data_silo = DataSilo(processor=processor, 
                     batch_size=batch_size, 
                     distributed=distributed,
                     max_multiprocessing_chunksize=300)

The full notebook that I used to successfully train DPR from scratch using xlm-roberta-base version is here

Let's discuss it here or reopen new issue?

In my case, lower down max_multiprocessing_chunksize got no effect, but I do agree to reduce max_multiprocessing_chunksize to fit more cases.

A straightaway solution maybe making both multiprocessing_chunk_size and num_processes to be an argument.

btw, using xlm-roberta-base will cause vocabulary mismatch in the haystack version, it may related to issue https://github.com/deepset-ai/haystack/issues/783#issue-795818540

xpatronum commented 3 years ago

@voidful, you're right about pointing out to the error related to fixed tokenizer to DPRTokenizer. For that reason I cloned the farm repo and not haystack where the tokenizer is fixed.
Could you plz share the error you're getting? In my case using the PRO version of colab with 25 gb of RAM solved the issue of RAM overflow. Maybe you could try to lower it down to, say, 100?

Timoeller commented 3 years ago

nice @thenewera-ru ! Any plans on putting these multilingual DPR models on the HF modelzoo? We are also looking into multilingual DPR and would love to base our work on those models.

Did you already try any evaluation on it on non-english data? e.g. by transforming the MLQA SQuAD style dataset into DPR with our script?


About the RAM usage, I see that your colab has 3 or I guess 4 CPUs. Did you attach a custom instance to it? Normally you could just set max_processes to 1 to disable multiprocessing (you still have rust multithreading from the tokenizers lib) and reduce mem requirements that way. Though we are happy to discuss any proposals on how to incorporate RAM into chunksize calculation.

voidful commented 3 years ago

@voidful, you're right about pointing out to the error related to fixed tokenizer to DPRTokenizer. For that reason I cloned the farm repo and not haystack where the tokenizer is fixed. Could you plz share the error you're getting? In my case using the PRO version of colab with 25 gb of RAM solved the issue of RAM overflow. Maybe you could try to lower it down to, say, 100?

Either setting --ipc="host" on docker run or max_processes=1 can solve my problem. Lower down max_multiprocessing_chunksize does not affect in my case, even reduce to 1.

voidful commented 3 years ago

nice @thenewera-ru ! Any plans on putting these multilingual DPR models on the HF modelzoo? We are also looking into multilingual DPR and would love to base our work on those models.

Did you already try any evaluation on it on non-english data? e.g. by transforming the MLQA SQuAD style dataset into DPR with our script?

About the RAM usage, I see that your colab has 3 or I guess 4 CPUs. Did you attach a custom instance to it? Normally you could just set max_processes to 1 to disable multiprocessing (you still have rust multithreading from the tokenizers lib) and reduce mem requirements that way. Though we are happy to discuss any proposals on how to incorporate RAM into chunksize calculation.

That's what I trying to accomplish, I already merged MLQA and all DPR data for training and evaluation. (30GB Train, 3GB Eval). I will update here when I have a further result.

xpatronum commented 3 years ago

nice @thenewera-ru ! Any plans on putting these multilingual DPR models on the HF modelzoo? We are also looking into multilingual DPR and would love to base our work on those models.

Did you already try any evaluation on it on non-english data? e.g. by transforming the MLQA SQuAD style dataset into DPR with our script?

About the RAM usage, I see that your colab has 3 or I guess 4 CPUs. Did you attach a custom instance to it? Normally you could just set max_processes to 1 to disable multiprocessing (you still have rust multithreading from the tokenizers lib) and reduce mem requirements that way. Though we are happy to discuss any proposals on how to incorporate RAM into chunksize calculation.

@Timoeller yes, my diploma work is dedicated to research the best combination of DPR query - passage encoders in multilingual tasks. For now I can assure that at least in Russian language - using xlm-roberta-base version of RoBERTa outperforms bert-base-multilingual-cased in terms of top-k retrieval passages on this data (Russian version of SQuAD) by more than 20% when k=10. More information about this dataset in this paper. Both models have seen the same number of train samples. This result kinda surprised me and in-depth analysis of embeddings generated at each layer led me to this paper (see page 4, (e) and (g) plots) from which the answer to question: "Why RoBERTa is better BERT in DPR" is more-a-less intuitive. P.S. I`m planning to have a paper by June, this year. As soon as it's ready I'd be glad to share the results and the model, of course (Still in the fine-tuning phase).

P.S. Right now I'm still in the process of further inspection training and evaluation of the model (I'm planning to fine-tune it more on Russian classical literature). As soon as it's ready I'd be glad to share the results and the model, of course.

P.P.S. Let's start this one dpr for all languages marathon together :-)

voidful commented 3 years ago

I have uploaded my training result on huggingface.

https://huggingface.co/voidful/dpr-question_encoder-bert-base-multilingual https://huggingface.co/voidful/dpr-ctx_encoder-bert-base-multilingual

Timoeller commented 3 years ago

Hey @voidful thanks for checking in with the mBert models.

Did you evaluate those embedders already? I see you used 73,710 QA pairs for dev. I guess without a full evaluation with many negative passages, but still, could you post dev results here?

voidful commented 3 years ago

Hey @voidful thanks for checking in with the mBert models.

Did you evaluate those embedders already? I see you used 73,710 QA pairs for dev. I guess without a full evaluation with many negative passages, but still, could you post dev results here?

Here is the evaluation result, it should be better to fine-tune further on specific dataset.

\\|//       \\|//      \\|//       \\|//     \\|//
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
***************************************************
***** EVALUATION | TEST SET | AFTER 48318 BATCHES *****
***************************************************
\\|//       \\|//      \\|//       \\|//     \\|//
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

INFO - farm.eval -   
 _________ text_similarity _________
farm.eval -   loss: 0.004132559135671326
INFO - farm.eval -   task_name: text_similarity
INFO - farm.eval -   acc: 0.9985889216783691
INFO - farm.eval -   f1: 0.9435626102292769
INFO - farm.eval -   acc_and_f1: 0.971075765953823
INFO - farm.eval -   average_rank: 0.0921041921041921
INFO - farm.eval -   report: 
                precision    recall  f1-score   support

hard_negative     0.9993    0.9993    0.9993   5822490
     positive     0.9436    0.9436    0.9436     73710

     accuracy                         0.9986   5896200
    macro avg     0.9714    0.9714    0.9714   5896200
 weighted avg     0.9986    0.9986    0.9986   5896200
dg4271 commented 3 years ago

I've tried DPR training with kobert and korquad QA dataset.

  1. I also had the hang problem, and I solved it through "--ipc=host".

  2. But "kobert" transfomers model cause vocabulary mismatch.

    btw, using xlm-roberta-base will cause vocabulary mismatch in the haystack version, it may related to issue #783 (comment)

(@voidful seems to have the same issue.)

Traceback (most recent call last):
  File "dpr_train.py", line 92, in <module>
    tutorial9_dpr_training()
  File "dpr_train.py", line 89, in tutorial9_dpr_training
    reloaded_retriever = DensePassageRetriever.load(load_dir=save_dir, document_store=None)
  File "/workspace/haystack/haystack/retriever/dense.py", line 397, in load
    dpr = cls(
  File "/workspace/haystack/haystack/retriever/dense.py", line 139, in __init__
    self.passage_encoder = LanguageModel.load(pretrained_model_name_or_path=passage_embedding_model,
  File "/opt/conda/lib/python3.8/site-packages/farm/modeling/language_model.py", line 142, in load
    language_model = cls.subclasses[config["name"]].load(pretrained_model_name_or_path)
  File "/opt/conda/lib/python3.8/site-packages/farm/modeling/language_model.py", line 1542, in load
    dpr_context_encoder.model = transformers.DPRContextEncoder.from_pretrained(farm_lm_model, config=dpr_config, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1154, in from_pretrained
    raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for DPRContextEncoder:
        size mismatch for ctx_encoder.bert_model.embeddings.word_embeddings.weight: copying a param with shape torch.Size([8002, 768]) from checkpoint, the shape in current model is torch.Size([30522, 768]).

This is probably because when learning DPR from scratch in the Farm code below, it is fixed to the default transformers.DPRConfig (vocab_size==30522). Please tell me if there is anything I am misunderstanding.

else:
            # Pytorch-transformer Style
            model_type = AutoConfig.from_pretrained(pretrained_model_name_or_path).model_type
            if model_type == "dpr":
                # "pretrained dpr model": load existing pretrained DPRContextEncoder model
                dpr_context_encoder.model = transformers.DPRContextEncoder.from_pretrained(
                    str(pretrained_model_name_or_path), **kwargs)
            else:
                # "from scratch": load weights from different architecture (e.g. bert) into DPRContextEncoder
                dpr_context_encoder.model = transformers.DPRContextEncoder(config=transformers.DPRConfig(**kwargs))
                dpr_context_encoder.model.base_model.bert_model = AutoModel.from_pretrained(
                    str(pretrained_model_name_or_path), **kwargs)
            dpr_context_encoder.language = cls._get_or_infer_language_from_name(language, pretrained_model_name_or_path)
tholor commented 3 years ago

@dg4271 Please see https://github.com/deepset-ai/haystack/issues/840#issuecomment-780720829 and set infer_tokenizer_classes=True when initializing DPR