Italian BERT - Githubissues

NeuroinformaticaFBF commented 2 years ago

Dear dbmdz team

I would like to use one of the italian BERT models that you ore-trained to create a model trained on a specific topic (medical languages). I would like to ask a few things that are not entirely clear to me:

Have these models been trained by applying Whole Word Masked?
I assume that these models have the 10% of [MASK] tokens not replaced and 10% replaced with random token, is that correct?
Are these model trained on Next Sentence Prediction as well?

Sorry if these may be trivial questions. Thanks a lot

stefan-it commented 2 years ago

Hi @NeuroinformaticaFBF ,

thanks for your interest in our model :hugs:

To answer your questions:

WWM was not applied
That's correct, the exact procedure is shown in the official BERT repo, here
BERT models are trained with NSP

As we use the official BERT implementation :)

NeuroinformaticaFBF commented 2 years ago

Many thanks for the reply, it was surprisingly quick!

So, since you did not implement the WWM, a further training of the model should not implement it as well, because it has to be coherent, right? And, would it be possible to apply WWM to your model in some way, or would it require a new training from scratch?

Many thanks for the support

stefan-it commented 2 years ago

Unfortunately, using WWM would require a complete re-training, which includes:

Re-creating the TFRecords using --do_whole_word_mask parameter for the create_pretraining_data.py script
Then running the actual pre-training procedure on TPU

Depending on the pre-training corpus size, the creation of TFRecords (= pre-training data) could take longer than the actual pre-training on TPU (when using e.g. a v3-32 TPU, pre-training takes ~66 hours for 3M steps).

NeuroinformaticaFBF commented 2 years ago

Dear @stefan-it

We are working intensively with your model to create our italian medical BERT model.

One question that I would like to ask: as I see from HuggingFace page of the model, "The source data for the Italian BERT model consists of a recent Wikipedia dump and various texts from the OPUS corpora collection... For the XXL Italian models, we use the same training data from OPUS and extend it with data from the Italian part of the OSCAR corpus". Were these dataset originally written in Italian, or were they english text translated?

I'm asking this because my medical corpus was originally written in English, and I used the Google translated API to translate it. So I would like to estimate the bias introduced by this operation.

Many thanks, Cheers

NeuroinformaticaFBF commented 2 years ago

Dear @stefan-it

We are working intensively with your model to create our italian medical BERT model.

One question that I would like to ask: as I see from HuggingFace page of the model, "The source data for the Italian BERT model consists of a recent Wikipedia dump and various texts from the OPUS corpora collection... For the XXL Italian models, we use the same training data from OPUS and extend it with data from the Italian part of the OSCAR corpus". Were these dataset originally written in Italian, or were they english text translated?

I'm asking this because my medical corpus was originally written in English, and I used the Google translated API to translate it. So I would like to estimate the bias introduced by this operation.

Many thanks, Cheers

See https://github.com/dbmdz/berts/issues/43 for the answer

stefan-it commented 2 years ago

Hi @NeuroinformaticaFBF , sorry that I have missed that question! I hope the answer in the last issue here clarifies it a bit more:)

dbmdz / berts

Italian BERT #39