Closed NeuroinformaticaFBF closed 2 years ago
Hi @NeuroinformaticaFBF ,
thanks for your interest in our model :hugs:
To answer your questions:
As we use the official BERT implementation :)
Many thanks for the reply, it was surprisingly quick!
So, since you did not implement the WWM, a further training of the model should not implement it as well, because it has to be coherent, right? And, would it be possible to apply WWM to your model in some way, or would it require a new training from scratch?
Many thanks for the support
Unfortunately, using WWM would require a complete re-training, which includes:
--do_whole_word_mask
parameter for the create_pretraining_data.py
scriptDepending on the pre-training corpus size, the creation of TFRecords (= pre-training data) could take longer than the actual pre-training on TPU (when using e.g. a v3-32 TPU, pre-training takes ~66 hours for 3M steps).
Dear @stefan-it
We are working intensively with your model to create our italian medical BERT model.
One question that I would like to ask: as I see from HuggingFace page of the model, "The source data for the Italian BERT model consists of a recent Wikipedia dump and various texts from the OPUS corpora collection... For the XXL Italian models, we use the same training data from OPUS and extend it with data from the Italian part of the OSCAR corpus". Were these dataset originally written in Italian, or were they english text translated?
I'm asking this because my medical corpus was originally written in English, and I used the Google translated API to translate it. So I would like to estimate the bias introduced by this operation.
Many thanks, Cheers
Dear @stefan-it
We are working intensively with your model to create our italian medical BERT model.
One question that I would like to ask: as I see from HuggingFace page of the model, "The source data for the Italian BERT model consists of a recent Wikipedia dump and various texts from the OPUS corpora collection... For the XXL Italian models, we use the same training data from OPUS and extend it with data from the Italian part of the OSCAR corpus". Were these dataset originally written in Italian, or were they english text translated?
I'm asking this because my medical corpus was originally written in English, and I used the Google translated API to translate it. So I would like to estimate the bias introduced by this operation.
Many thanks, Cheers
See https://github.com/dbmdz/berts/issues/43 for the answer
Hi @NeuroinformaticaFBF , sorry that I have missed that question! I hope the answer in the last issue here clarifies it a bit more:)
Dear dbmdz team
I would like to use one of the italian BERT models that you ore-trained to create a model trained on a specific topic (medical languages). I would like to ask a few things that are not entirely clear to me:
Sorry if these may be trivial questions. Thanks a lot