Vocav.txt may be wrongly encoded?

Hey! Thanks for updating this model, community was lacking something like this.

I am trying to use BERT for a NER task for document anonymization in Spanish. I achieved to do so by using bert-multilingual. Now the turn came to try to get better results using a specifically spanish trained verison of bert. I run into your model and trained a ner model for my task.

The problem arises when I try to tokenize a sentence, where I get strange results. For example, the sentence: " Datos del paciente. Nombre: María. Apellidos: Rico Pedroza. País: España."

When tokenized returns: 'Ċ', 'Datos', 'Ġdel', 'Ġpaciente', '.', 'Ċ', 'Nombre', ':', 'Ġ', 'ĠMarÃŃa', '.', 'Ċ', 'A', 'pell', 'idos', ':', 'ĠRico', 'ĠPedro', 'za', '.', 'Ċ', 'PaÃŃs', ':', 'ĠEspaÃ±a', '.']

After some searching, I have found that the vocab.txt an merge files (which are used directly by the tokenizer) are saved with wrong encoding, which returns strange characters for tipical spanish caracters. For example 'mayoría' becomes "ĠmayorÃŃa": 2143 in the vocab.json, as well as 'comunicación' appears as "ĠcomunicaciÃ³n": 2670.

I have tried changind codification from UTF-8 to ansii and others, and it never changes. This makes me think that maybe the vocab.json fuile was simly uploaded with the wrong codification.

Just wanted to let you know and see if we can find a solution to be able to use SpanBERTa.

Hope to hear from you soon!

Regards.

Hi @bichomartiano,

Thank you for your interest in SpanBERTa. I also observed this when using model. However, it doesn't really cause any trouble in practice, because when we use the same tokenizer to decode the output tokens, it will return the correct output text.

Here is the Colab notebook showing how I trained the tokenizer and pretrained SpanBERTa and here is how I applied SpanBERTa to do NER. As you can see, even though the tokens' presentations are not accurate, it doesn't really affect the training process and the output text is correct.

For real use cases, I suggest fine-tuning Beto, which is also BERT for Spanish but it was pre-trained longer on a different large corpus. I observed Beto achieved slightly better score on downstream tasks.

chriskhanhtran / spanish-bert

Vocav.txt may be wrongly encoded? #1