Open yu-rovikov opened 1 year ago
hello I have the same problem except that my dataset contains hard_neg_examples but no hard_neg .
The problem was in hydra configurations. Although my biencoder_default.yaml had batch_size: 2
, the script (train_dense_encoder.py) still ran with batch_size=1
. I did not find a regular way to set batch_size=2
(or any other value). A temporary workaround is to run the stript from the command line as follows:
python train_dense_encoder.py train_datasets=[lean_questions_one_lemma_train] dev_datasets=[lean_questions_one_lemma_dev] train=biencoder_local output_dir=outputs train.batch_size=<NEW BATCH_SIZE>
Hi! I've been trying to train DPR using the in-batch negatives schema on a custom dataset with no
negative_ctxs
andhard_negative_ctxs
with default configs. It appears that the network does not train properly on such datasets. In particular, loss on every training step is 0:It seems that the issue is indeed due to the abscence of negative examples in the dataset: when I add random positive paragraphs from other questions as negatives, the retriever seems to train properly:
However, I don't want any fixed random paragraphs as negatives in my dataset. It seems that either the in-batch negatives schema does not apply when there are no negative_ctxs, or it does not apply in the default settings at all. I was not able to find the reason in
_calc_loss
(train_dense_encoder.py
).Is it possible to train the retriever on such datasets? Or do I need at least one
negative_ctxs
for each data point? Thank you!P.S. The dataset I am using looks like this (two exalmples):
It is designed to search relevant lemmas for automated theorem proving.
The only thing I changed in the repo is the
encoder_train_default.yaml
config where I added my custom dataset: