Issue with the Data Processor using the FarmReader to retrain a model

F95GIT commented 3 years ago

Question I am fine tuning a pre trained multilingual model which got trained on SQUAD ("salti/bert-base-multilingual-cased-finetuned-squad") with the German MLQA Dataset which has a size of 4 MB. I am using the FarmReader with the option to retrain the model. Everything is working fine, but during the Data Preprocessing only 20% of the Data is loaded and used.

Here is the Code of the reader

reader = FARMReader(model_name_or_path="salti/bert-base-multilingual-cased-finetuned-squad", use_gpu=True) train_data = "/content/gdrive/My Drive/Colab Notebooks/Data_NLP/MQA/"reader.train(data_dir=train_data, train_filename="test-context-de-question-de Kopie.json", use_gpu=True, n_epochs=4, save_dir="/content/gdrive/My Drive/1/6")

Could you maybe help me with this problem and how I can use the entire training set?

Thanks in advance Felix

Timoeller commented 3 years ago

I understand why you opened up the issue here, but could you ask in haystack next time, since this is haystack code?

So we have been training on MLQA before without problems. To me it rather seems as if haystack did not update the progress bar correctly but still converted everything. I will calculate why I believe so: In an epoch we have 852 batches, batch size is by default 10, so it handles 8520 text snippets/passages. MLQA test set has 4517 QA pairs with corresponding context for each pair. If this context is roughly longer than max_seq_len the context gets splitted into multiple parts. So some of these 4517 QA pairs will be splitted, some not, resulting in 8520 text passages that need to be handled.

Sorry for the many numbers, but I quite like them : )

Timoeller commented 3 years ago

Seems resolved, closing now.

deepset-ai / FARM

Issue with the Data Processor using the FarmReader to retrain a model #546