Closed voidful closed 3 years ago
Hey @voidful I cannot reproduce your issue:
Preprocessing Dataset data/dpr_training/train/biencoder-nq-train.json: 100%|██████████| 58880/58880 [00:44<00:00, 1313.35 Dicts/s]
Did you check your RAM usage? Getting stuck during preprocessing without warning indicates you are swapping or are just OOM. Be aware that the unpacked train dataset is 7.4GB on disk and I have peak RAM usage during preprocessing of 30 GB.
Please do not use latest FARM master when running haystack DPR, we have made some changes in FARM to DPR preprocessing that are not tested in haystack yet. Please use FARM==0.6.2 as specified in the haystack requirements.
Hey @voidful I cannot reproduce your issue:
Preprocessing Dataset data/dpr_training/train/biencoder-nq-train.json: 100%|██████████| 58880/58880 [00:44<00:00, 1313.35 Dicts/s]
Did you check your RAM usage? Getting stuck during preprocessing without warning indicates you are swapping or are just OOM. Be aware that the unpacked train dataset is 7.4GB on disk and I have peak RAM usage during preprocessing of 30 GB.
Please do not use latest FARM master when running haystack DPR, we have made some changes in FARM to DPR preprocessing that are not tested in haystack yet. Please use FARM==0.6.2 as specified in the haystack requirements.
My ram usage is 40.5/126G, it seems fine. I have rollback to FARM==0.6.2, and the problem still exists.
Nevertheless, when I try to s set DataSilo max_processes=1 in DensePassageRetriever.train, it works. Maybe it's a multi-processing problem.
Ah nice, you seem to have a beefy machine : ) and also found a proper workaround.
Do you have a dockerized environment? There we often observe mp issues. If so did you try setting --ipc="host" when running the docker?
Ah nice, you seem to have a beefy machine : ) and also found a proper workaround.
Do you have a dockerized environment? There we often observe mp issues. If so did you try setting --ipc="host" when running the docker?
I have tried setting --ipc="host", and it works like a charm. Never see that coming...
@Timoeller How about we add a check in FARM and log an info if we are running in a container + multiprocessing is enabled. We can check via something like this https://stackoverflow.com/questions/20010199/how-to-determine-if-a-process-runs-inside-lxc-docker
Nice one @tholor since we have seen these problems quite often. Added https://github.com/deepset-ai/FARM/issues/717 in FARM
I've tried running the training DPR notebook in Colab having and also got stuck.
The problem is that calc_chunk_size
that is being used to calculate default max_chunk_size
doesn't take into account available RAM
.
def calc_chunksize(num_dicts, min_chunksize=4, max_chunksize=2000, max_processes=128):
if mp.cpu_count() > 3:
num_cpus = min(mp.cpu_count() - 1 or 1, max_processes) # -1 to keep a CPU core free for xxx
else:
num_cpus = min(mp.cpu_count(), max_processes) # when there are few cores, we use all of them
dicts_per_cpu = np.ceil(num_dicts / num_cpus)
# automatic adjustment of multiprocessing chunksize
# for small files (containing few dicts) we want small chunksize to ulitize all available cores but never less
# than 2, because we need it to sample another random sentence in LM finetuning
# for large files we want to minimize processor spawning without giving too much data to one process, so we
# clip it at 5k
multiprocessing_chunk_size = int(np.clip((np.ceil(dicts_per_cpu / 5)), a_min=min_chunksize, a_max=max_chunksize))
# This lets us avoid cases in lm_finetuning where a chunk only has a single doc and hence cannot pick
# a valid next sentence substitute from another document
if num_dicts != 1:
while num_dicts % multiprocessing_chunk_size == 1:
multiprocessing_chunk_size -= -1
dict_batches_to_process = int(num_dicts / multiprocessing_chunk_size)
num_processes = min(num_cpus, dict_batches_to_process) or 1
return multiprocessing_chunk_size, num_processes
I'm assuming not everyone has an opportunity to fine-tune DPR on their own machine with 40+
gb of RAMS. So I suggest to
calc_chunksize
by taking the RAM
into account.max_chunksize=300
when creating DataSilo
object.
# 3. Create a DataSilo that loads several datasets (train/dev/test), provides DataLoaders for them and calculates a few descriptive statistics of our datasets
# NOTE: In FARM, the dev set metrics differ from test set metrics in that they are calculated on a token level instead of a word level
data_silo = DataSilo(processor=processor,
batch_size=batch_size,
distributed=distributed,
max_multiprocessing_chunksize=300)
The full notebook that I used to successfully train DPR from scratch using xlm-roberta-base
version is here
Let's discuss it here or reopen new issue?
I've tried running the training DPR notebook in Colab having and also got stuck. The problem is that
calc_chunk_size
that is being used to calculate defaultmax_chunk_size
doesn't take into account availableRAM
.def calc_chunksize(num_dicts, min_chunksize=4, max_chunksize=2000, max_processes=128): if mp.cpu_count() > 3: num_cpus = min(mp.cpu_count() - 1 or 1, max_processes) # -1 to keep a CPU core free for xxx else: num_cpus = min(mp.cpu_count(), max_processes) # when there are few cores, we use all of them dicts_per_cpu = np.ceil(num_dicts / num_cpus) # automatic adjustment of multiprocessing chunksize # for small files (containing few dicts) we want small chunksize to ulitize all available cores but never less # than 2, because we need it to sample another random sentence in LM finetuning # for large files we want to minimize processor spawning without giving too much data to one process, so we # clip it at 5k multiprocessing_chunk_size = int(np.clip((np.ceil(dicts_per_cpu / 5)), a_min=min_chunksize, a_max=max_chunksize)) # This lets us avoid cases in lm_finetuning where a chunk only has a single doc and hence cannot pick # a valid next sentence substitute from another document if num_dicts != 1: while num_dicts % multiprocessing_chunk_size == 1: multiprocessing_chunk_size -= -1 dict_batches_to_process = int(num_dicts / multiprocessing_chunk_size) num_processes = min(num_cpus, dict_batches_to_process) or 1 return multiprocessing_chunk_size, num_processes
I'm assuming not everyone has an opportunity to fine-tune DPR on their own machine with
40+
gb of RAMS. So I suggest to
- Improve
calc_chunksize
by taking theRAM
into account.- As a quick fix In tutorial - pass manually the
max_chunksize=300
when creatingDataSilo
object.# 3. Create a DataSilo that loads several datasets (train/dev/test), provides DataLoaders for them and calculates a few descriptive statistics of our datasets # NOTE: In FARM, the dev set metrics differ from test set metrics in that they are calculated on a token level instead of a word level data_silo = DataSilo(processor=processor, batch_size=batch_size, distributed=distributed, max_multiprocessing_chunksize=300)
The full notebook that I used to successfully train DPR from scratch using
xlm-roberta-base
version is hereLet's discuss it here or reopen new issue?
In my case, lower down max_multiprocessing_chunksize got no effect, but I do agree to reduce max_multiprocessing_chunksize to fit more cases.
A straightaway solution maybe making both multiprocessing_chunk_size and num_processes to be an argument.
btw, using xlm-roberta-base
will cause vocabulary mismatch in the haystack version, it may related to issue https://github.com/deepset-ai/haystack/issues/783#issue-795818540
@voidful, you're right about pointing out to the error related to fixed tokenizer to DPRTokenizer
.
For that reason I cloned the farm
repo and not haystack
where the tokenizer is fixed.
Could you plz share the error you're getting? In my case using the PRO version of colab with 25
gb of RAM solved the issue of RAM overflow. Maybe you could try to lower it down to, say, 100
?
nice @thenewera-ru ! Any plans on putting these multilingual DPR models on the HF modelzoo? We are also looking into multilingual DPR and would love to base our work on those models.
Did you already try any evaluation on it on non-english data? e.g. by transforming the MLQA SQuAD style dataset into DPR with our script?
About the RAM usage, I see that your colab has 3 or I guess 4 CPUs. Did you attach a custom instance to it? Normally you could just set max_processes to 1 to disable multiprocessing (you still have rust multithreading from the tokenizers lib) and reduce mem requirements that way. Though we are happy to discuss any proposals on how to incorporate RAM into chunksize calculation.
@voidful, you're right about pointing out to the error related to fixed tokenizer to
DPRTokenizer
. For that reason I cloned thefarm
repo and nothaystack
where the tokenizer is fixed. Could you plz share the error you're getting? In my case using the PRO version of colab with25
gb of RAM solved the issue of RAM overflow. Maybe you could try to lower it down to, say,100
?
Either setting --ipc="host" on docker run or max_processes=1 can solve my problem. Lower down max_multiprocessing_chunksize does not affect in my case, even reduce to 1.
nice @thenewera-ru ! Any plans on putting these multilingual DPR models on the HF modelzoo? We are also looking into multilingual DPR and would love to base our work on those models.
Did you already try any evaluation on it on non-english data? e.g. by transforming the MLQA SQuAD style dataset into DPR with our script?
About the RAM usage, I see that your colab has 3 or I guess 4 CPUs. Did you attach a custom instance to it? Normally you could just set max_processes to 1 to disable multiprocessing (you still have rust multithreading from the tokenizers lib) and reduce mem requirements that way. Though we are happy to discuss any proposals on how to incorporate RAM into chunksize calculation.
That's what I trying to accomplish, I already merged MLQA and all DPR data for training and evaluation. (30GB Train, 3GB Eval). I will update here when I have a further result.
nice @thenewera-ru ! Any plans on putting these multilingual DPR models on the HF modelzoo? We are also looking into multilingual DPR and would love to base our work on those models.
Did you already try any evaluation on it on non-english data? e.g. by transforming the MLQA SQuAD style dataset into DPR with our script?
About the RAM usage, I see that your colab has 3 or I guess 4 CPUs. Did you attach a custom instance to it? Normally you could just set max_processes to 1 to disable multiprocessing (you still have rust multithreading from the tokenizers lib) and reduce mem requirements that way. Though we are happy to discuss any proposals on how to incorporate RAM into chunksize calculation.
@Timoeller yes, my diploma work is dedicated to research the best combination of DPR query
- passage
encoders in multilingual tasks.
For now I can assure that at least in Russian language - using xlm-roberta-base
version of RoBERTa outperforms bert-base-multilingual-cased
in terms of top-k
retrieval passages on this data (Russian version of SQuAD) by more than 20%
when k=10
. More information about this dataset in this paper. Both models have seen the same number of train samples.
This result kinda surprised me and in-depth analysis of embeddings generated at each layer led me to this paper (see page 4, (e) and (g) plots) from which the answer to question: "Why RoBERTa is better BERT in DPR" is more-a-less intuitive.
P.S. I`m planning to have a paper by June, this year. As soon as it's ready I'd be glad to share the results and the model, of course (Still in the fine-tuning phase).
P.S. Right now I'm still in the process of further inspection training and evaluation of the model (I'm planning to fine-tune it more on Russian classical literature). As soon as it's ready I'd be glad to share the results and the model, of course.
P.P.S. Let's start this one dpr for all languages
marathon together :-)
I have uploaded my training result on huggingface.
https://huggingface.co/voidful/dpr-question_encoder-bert-base-multilingual https://huggingface.co/voidful/dpr-ctx_encoder-bert-base-multilingual
Hey @voidful thanks for checking in with the mBert models.
Did you evaluate those embedders already? I see you used 73,710 QA pairs for dev. I guess without a full evaluation with many negative passages, but still, could you post dev results here?
Hey @voidful thanks for checking in with the mBert models.
Did you evaluate those embedders already? I see you used 73,710 QA pairs for dev. I guess without a full evaluation with many negative passages, but still, could you post dev results here?
Here is the evaluation result, it should be better to fine-tune further on specific dataset.
\\|// \\|// \\|// \\|// \\|//
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
***************************************************
***** EVALUATION | TEST SET | AFTER 48318 BATCHES *****
***************************************************
\\|// \\|// \\|// \\|// \\|//
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO - farm.eval -
_________ text_similarity _________
farm.eval - loss: 0.004132559135671326
INFO - farm.eval - task_name: text_similarity
INFO - farm.eval - acc: 0.9985889216783691
INFO - farm.eval - f1: 0.9435626102292769
INFO - farm.eval - acc_and_f1: 0.971075765953823
INFO - farm.eval - average_rank: 0.0921041921041921
INFO - farm.eval - report:
precision recall f1-score support
hard_negative 0.9993 0.9993 0.9993 5822490
positive 0.9436 0.9436 0.9436 73710
accuracy 0.9986 5896200
macro avg 0.9714 0.9714 0.9714 5896200
weighted avg 0.9986 0.9986 0.9986 5896200
I've tried DPR training with kobert and korquad QA dataset.
I also had the hang problem, and I solved it through "--ipc=host".
But "kobert" transfomers model cause vocabulary mismatch.
btw, using xlm-roberta-base will cause vocabulary mismatch in the haystack version, it may related to issue #783 (comment)
(@voidful seems to have the same issue.)
Traceback (most recent call last):
File "dpr_train.py", line 92, in <module>
tutorial9_dpr_training()
File "dpr_train.py", line 89, in tutorial9_dpr_training
reloaded_retriever = DensePassageRetriever.load(load_dir=save_dir, document_store=None)
File "/workspace/haystack/haystack/retriever/dense.py", line 397, in load
dpr = cls(
File "/workspace/haystack/haystack/retriever/dense.py", line 139, in __init__
self.passage_encoder = LanguageModel.load(pretrained_model_name_or_path=passage_embedding_model,
File "/opt/conda/lib/python3.8/site-packages/farm/modeling/language_model.py", line 142, in load
language_model = cls.subclasses[config["name"]].load(pretrained_model_name_or_path)
File "/opt/conda/lib/python3.8/site-packages/farm/modeling/language_model.py", line 1542, in load
dpr_context_encoder.model = transformers.DPRContextEncoder.from_pretrained(farm_lm_model, config=dpr_config, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1154, in from_pretrained
raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for DPRContextEncoder:
size mismatch for ctx_encoder.bert_model.embeddings.word_embeddings.weight: copying a param with shape torch.Size([8002, 768]) from checkpoint, the shape in current model is torch.Size([30522, 768]).
This is probably because when learning DPR from scratch in the Farm code below, it is fixed to the default transformers.DPRConfig (vocab_size==30522). Please tell me if there is anything I am misunderstanding.
else:
# Pytorch-transformer Style
model_type = AutoConfig.from_pretrained(pretrained_model_name_or_path).model_type
if model_type == "dpr":
# "pretrained dpr model": load existing pretrained DPRContextEncoder model
dpr_context_encoder.model = transformers.DPRContextEncoder.from_pretrained(
str(pretrained_model_name_or_path), **kwargs)
else:
# "from scratch": load weights from different architecture (e.g. bert) into DPRContextEncoder
dpr_context_encoder.model = transformers.DPRContextEncoder(config=transformers.DPRConfig(**kwargs))
dpr_context_encoder.model.base_model.bert_model = AutoModel.from_pretrained(
str(pretrained_model_name_or_path), **kwargs)
dpr_context_encoder.language = cls._get_or_infer_language_from_name(language, pretrained_model_name_or_path)
@dg4271 Please see https://github.com/deepset-ai/haystack/issues/840#issuecomment-780720829 and set infer_tokenizer_classes=True
when initializing DPR
Describe the bug I follow the DPR training example with all the default settings. (Training DPR from Scratch) However, when Preprocessing Dataset data/dpr_training/train/biencoder-nq-train.json, it will get stuck and no response.
Error message Preprocessing Dataset data/dpr_training/train/biencoder-nq-train.json: 11% 6288/58880 [00:19<00:29, 1799.01 Dicts/s]
Additional context I have tried using the latest FARM version from GitHub Updating transformers but all got no luck.
To Reproduce Run the script https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial9_DPR_training.ipynb