dbmdz / berts

DBMDZ BERT, DistilBERT, ELECTRA, GPT-2 and ConvBERT models
MIT License
155 stars 12 forks source link

Italian BERT #39

Closed NeuroinformaticaFBF closed 2 years ago

NeuroinformaticaFBF commented 2 years ago

Dear dbmdz team

I would like to use one of the italian BERT models that you ore-trained to create a model trained on a specific topic (medical languages). I would like to ask a few things that are not entirely clear to me:

Sorry if these may be trivial questions. Thanks a lot

stefan-it commented 2 years ago

Hi @NeuroinformaticaFBF ,

thanks for your interest in our model :hugs:

To answer your questions:

As we use the official BERT implementation :)

NeuroinformaticaFBF commented 2 years ago

Many thanks for the reply, it was surprisingly quick!

So, since you did not implement the WWM, a further training of the model should not implement it as well, because it has to be coherent, right? And, would it be possible to apply WWM to your model in some way, or would it require a new training from scratch?

Many thanks for the support

stefan-it commented 2 years ago

Unfortunately, using WWM would require a complete re-training, which includes:

Depending on the pre-training corpus size, the creation of TFRecords (= pre-training data) could take longer than the actual pre-training on TPU (when using e.g. a v3-32 TPU, pre-training takes ~66 hours for 3M steps).

NeuroinformaticaFBF commented 2 years ago

Dear @stefan-it

We are working intensively with your model to create our italian medical BERT model.

One question that I would like to ask: as I see from HuggingFace page of the model, "The source data for the Italian BERT model consists of a recent Wikipedia dump and various texts from the OPUS corpora collection... For the XXL Italian models, we use the same training data from OPUS and extend it with data from the Italian part of the OSCAR corpus". Were these dataset originally written in Italian, or were they english text translated?

I'm asking this because my medical corpus was originally written in English, and I used the Google translated API to translate it. So I would like to estimate the bias introduced by this operation.

Many thanks, Cheers

NeuroinformaticaFBF commented 2 years ago

Dear @stefan-it

We are working intensively with your model to create our italian medical BERT model.

One question that I would like to ask: as I see from HuggingFace page of the model, "The source data for the Italian BERT model consists of a recent Wikipedia dump and various texts from the OPUS corpora collection... For the XXL Italian models, we use the same training data from OPUS and extend it with data from the Italian part of the OSCAR corpus". Were these dataset originally written in Italian, or were they english text translated?

I'm asking this because my medical corpus was originally written in English, and I used the Google translated API to translate it. So I would like to estimate the bias introduced by this operation.

Many thanks, Cheers

See https://github.com/dbmdz/berts/issues/43 for the answer

stefan-it commented 2 years ago

Hi @NeuroinformaticaFBF , sorry that I have missed that question! I hope the answer in the last issue here clarifies it a bit more:)