chriskhanhtran / spanish-bert

Pretrain RoBERTa for Spanish from scratch and perform NER on Spanish documents
26 stars 12 forks source link

Vocav.txt may be wrongly encoded? #1

Open martianbug opened 4 years ago

martianbug commented 4 years ago

Hey! Thanks for updating this model, community was lacking something like this.

I am trying to use BERT for a NER task for document anonymization in Spanish. I achieved to do so by using bert-multilingual. Now the turn came to try to get better results using a specifically spanish trained verison of bert. I run into your model and trained a ner model for my task.

The problem arises when I try to tokenize a sentence, where I get strange results. For example, the sentence: " Datos del paciente. Nombre: María. Apellidos: Rico Pedroza. País: España."

When tokenized returns: 'Ċ', 'Datos', 'Ġdel', 'Ġpaciente', '.', 'Ċ', 'Nombre', ':', 'Ġ', 'ĠMarÃŃa', '.', 'Ċ', 'A', 'pell', 'idos', ':', 'ĠRico', 'ĠPedro', 'za', '.', 'Ċ', 'PaÃŃs', ':', 'ĠEspaña', '.']

After some searching, I have found that the vocab.txt an merge files (which are used directly by the tokenizer) are saved with wrong encoding, which returns strange characters for tipical spanish caracters. For example 'mayoría' becomes "ĠmayorÃŃa": 2143 in the vocab.json, as well as 'comunicación' appears as "Ġcomunicación": 2670.

I have tried changind codification from UTF-8 to ansii and others, and it never changes. This makes me think that maybe the vocab.json fuile was simly uploaded with the wrong codification.

Just wanted to let you know and see if we can find a solution to be able to use SpanBERTa.

Hope to hear from you soon!

Regards.

chriskhanhtran commented 4 years ago

Hi @bichomartiano,

Thank you for your interest in SpanBERTa. I also observed this when using model. However, it doesn't really cause any trouble in practice, because when we use the same tokenizer to decode the output tokens, it will return the correct output text.

Here is the Colab notebook showing how I trained the tokenizer and pretrained SpanBERTa and here is how I applied SpanBERTa to do NER. As you can see, even though the tokens' presentations are not accurate, it doesn't really affect the training process and the output text is correct.

For real use cases, I suggest fine-tuning Beto, which is also BERT for Spanish but it was pre-trained longer on a different large corpus. I observed Beto achieved slightly better score on downstream tasks.