Broken encoding of vocab.json

ai-forever / ru-gpts

Russian GPT3 models.

Apache License 2.0

2.08k stars 444 forks source link

Broken encoding of vocab.json #24

Closed kniazevgeny closed 3 years ago

kniazevgeny commented 3 years ago

I was fine-tuning ruGPT3-Medium for QA, but there was some problems with training. After setting 50 epochs with small dataset (to be sure that I can fine-tune model) I found that the only answer was in english. So I looked what was in the vocab.json . I found lots of broken (?) symbols with strange encoding. I tried to change it to windows1252, windows 1251 and iso8859-5 but there was no result. Can you please explain me what I did wrong or just fix the vocab.json

fen0s commented 3 years ago

That's something that weirded me off, too. If you try to train it, you may also look at features file which contains lots of broken encoding. I've mentioned this in my issue, too. Seems like it's somewhat readable for the generator, though? Are you sure you set your encoding for output to UTF-8?

kniazevgeny commented 3 years ago

It could be readable for generator, but in this case there should be some preprocessing of my training data. If so, that would be strange. As for encoding of the file, yeah, it's set to UTF-8 (and you can see it in the right bottom corner). By the way, there's a link to vocab.json above. And yes, I might look at features file, but I'm not really familiar with this. So if you gave me some instructions, I'd try.

kniazevgeny commented 3 years ago

I reviewed the source code and realized that I used custom tokenizer. Then I replaced it with tokenizer from transformers library and got my human-readable answers converted into the text similar to shown above. Now I'm testing whether it would work

fen0s commented 3 years ago

Yeah, I have no idea then. You might look at how data is preproccessed for GPT-3 in readme file, if you didn't preproccess your dataset. GPT-2 also needs dataset preproccessed, though it's not listed in the readme. Seems like a really raw model implementation tbh.

kniazevgeny commented 3 years ago

I've just tested GPT2Tokenizer from transformers, and now I see that the problem was in my custom tokenizer. So vocab.json seems to be ok, and it's encoding is all right