Closed jbellic closed 3 years ago
Hi @jbellic ,
that's a good question. I do have TPU access for training an uncased version. However, after some lessons learnt with uncased vocabs, I would disable accent stripping and create a new vocab: the uncased vocab of the BERT models used accent stripping and I'm highly convinced that this option will harm performance for downstream tasks.
I'll keep you up-to-date in this issue here for the progress of an uncased ConvBERT model :)
Update: the vocab of the BERT uncased model was created with no accent stripping option. However. BERT pretraining data used the accent stripping option.
I'm going to train an uncased ConvBERT model on the large mC4 corpus now - with no accent stripping.
Yeah, tokenization can be very complicated 😅
The training for the uncased ConvBERT model (on mC4 corpus) has completed, I'm currently doing some evaluations on downstream tasks, then I'll upload the model (should be ready until end of the week).
Thank you very much Stefan! Great work.
Hi, thanks for the bert models. Is there any chance to see a uncased version of convbert-base-turkish anytime?
BR.