request for uncased version of convbert-base-turkish

dbmdz / berts

DBMDZ BERT, DistilBERT, ELECTRA, GPT-2 and ConvBERT models

MIT License

155 stars 12 forks source link

request for uncased version of convbert-base-turkish #35

Closed jbellic closed 3 years ago

jbellic commented 3 years ago

Hi, thanks for the bert models. Is there any chance to see a uncased version of convbert-base-turkish anytime?

BR.

stefan-it commented 3 years ago

Hi @jbellic ,

that's a good question. I do have TPU access for training an uncased version. However, after some lessons learnt with uncased vocabs, I would disable accent stripping and create a new vocab: the uncased vocab of the BERT models used accent stripping and I'm highly convinced that this option will harm performance for downstream tasks.

I'll keep you up-to-date in this issue here for the progress of an uncased ConvBERT model :)

stefan-it commented 3 years ago

Update: the vocab of the BERT uncased model was created with no accent stripping option. However. BERT pretraining data used the accent stripping option.

I'm going to train an uncased ConvBERT model on the large mC4 corpus now - with no accent stripping.

Yeah, tokenization can be very complicated 😅

stefan-it commented 3 years ago

The training for the uncased ConvBERT model (on mC4 corpus) has completed, I'm currently doing some evaluations on downstream tasks, then I'll upload the model (should be ready until end of the week).

stefan-it commented 3 years ago

Hi @jbellic ,

new uncased models are available on the model hub:

ConvBERTurk mC4 (uncased): here
ELECTRA mC4 (uncased): here

jbellic commented 3 years ago

Thank you very much Stefan! Great work.