Input Chinese, the predicted is Japanese.

dpyneo commented 5 years ago

Hello Connor Leahy. Thank you very much for your excellent model project. It's very cool. And I'm more happy and exciting that there will be more open source models in the future. But I'm a Chinese user. When using Pretty Big model, I input Chinese, and the results are predicted in Japanese. Can I support the Chinese model?

ConnorJL commented 5 years ago

Hi there. These models were trained primarily on English text, so I have no idea how good or bad it is for other languages. By default, it should be able to handle any text that can be encoded by the BPE encoder. You will probably have to retrain the model on chinese text in order to get better results.

dpyneo commented 5 years ago

Thank you very much for your busy reply. Would you like to ask if this should be re-trained in Chinese, or if BERT only needs fine-tune, if it is fine-tune, how to do this? I don't know if there will be plans to launch models in other languages in the future. If you have Chinese in time, I hope you can make a contribution. Thank you again for your sharing. It's great. Here are some of the predicted results. Last time I forgot to post them.

======================================== SAMPLE 0 ======================================== 我是厉飞雨中団集道占标情乎不外弁的空何的主热者余主似可能性。

丆会者が移要、成努体余礍等素傷名省冔容以丐会の使用を成功する人が可能だ。

为了一接経懑用体的会段階の構玖は、自分の展貌の演由に意员した他の孆地地を感しており、段階の濑えて最近なものが今回。

如你的能性的項分

私のミッケージに答及した言い資数指定

世界中団に沢項決気师ゎ

vochicong commented 5 years ago

The text "SAMPLE 0" generated above is not completely Japanese, but a random mix of Chinese and Japanese.

voidism commented 5 years ago

The original GPT2 model also have this problem. I have tried it.
GPT2 use byte pair encoding, it means that many of the Chinese words are still represented by three bytes, not a single token. Because it didn't see many Chinese words when building vocabs, it is unable to combine bytes into a true word unit. So you need to build a new model with new vocabs for Chinese from scratch, not just use this English model to fine-tune.
I think why GPT2 can be successful is because that it used a large dataset from web(40GB), with humans checking the text quality (using Reddit karma). But for Chinese, there aren't public available datasets as big as what GPT2 has used.

vochicong commented 5 years ago

I've tried to generate Japanese text. To my surprise, the model output is almost all legal Japanese characters (no Chinese character mixed), though the words and sentences are very strange.

!python3 main.py --model PrettyBig_colab.json --top_k 40 --predict_text "猫はネズミを"

...
======================================== SAMPLE 0 ========================================

猫はネズミをアフリです」とお思います。した「言も市気についてもとうかった」にも大錭際のための姀別の事実に寄りたりなのでいただけるだろう。「言だけ」とはなぜんだけないので、これだけではありません。

楽しめてしさが、実際に対しても合まで統制よう言くなりがば、その中の機胞が可能です。ふっても、どうような未杯件が高いので自分だったものだろうか?いっかくわけようようで、つからありません。ほどの誤くちは良い改持していますが、それに世界中の中にはないことができたけどれば、いけれど、ひい、しかしも、いますは種りだが止れからどく名じくなりという人もようができるか。自由を自己したことだが、どうのかでにそれを徴期しているようにしておりません。

言を說明したとこと

そなし、合わな形生のあった。、いまま、合わな形生のあった。それには、というものが、そこで「ややややっという」というが、どうかった感じたらしか無事な形生のなど、そのもセキュリティ(あらようにどういて)。このあったので、そうなる攻省でしょうとした。そうしても、読者のような提価の他も、それにあれたい読者は、これや件いのなっのもときっていいいけるようになりました。そんなど、それにもなくてがず、というは「うもぉっど」っています。

ろしくんう値だったよね。しかした言明だけでは、いまかは感じた。そうしても、それには「以外の言しには自分ない」を微計すようになら、いままないとこれはやっという本私としてもすれば「感じためらない」むしました。そうしたものかで、しかした言を調べばための前徶にするかもしれば、いままだ態度はそれがどうだけませについても、それにもそうなど、どうまであったことはないか、という感じたけないのです。しかし、これや、感じたけてもどうだ。

それかったものが、しかしょうだないよね。つをい、それにも人があり、いら、ややや぀人がだかし、どうかった�

================================================================================

dpyneo commented 5 years ago

First of all, thank voidism and vochicong for their answers.

I guess it's possible that Japanese occupies more of the training corpus than Chinese, probably the first 40 may be Japanese, but because the corpus should be mixed with some Japanese in English, leading to the prediction in Japanese. The model learns the representation semantics of English, not the semantics of Japanese, which leads to the same problem as voidism. In fact, the model only learns English. The fixed usage of grammar in Web pages, while for other languages, only the model is in accordance with the content it has seen cuanl

ConnorJL commented 5 years ago

Thanks for the interesting comments everyone. I think under the line what it comes down to is that the model was trained primarily on english text, so it naturally struggles with very different languages such as Chinese and Japanese. To get better performance one would probably need to collect a proper Chinese dataset and maybe even create a new BPE vocabulary that focuses on non-english languages. I don't have any plans on doing so, since I'm not qualified to even know what chinese text would be good or not, but it would be a cool experiment for others to try.

Cyvadra commented 4 years ago

I've got 20g cn txt file in hand but no idea how to build the encoder, BPE. I did run the script but seems its format doesn't fit gpt2.... might not be a problem.. cuz I think the key of gpt2 is not its code, but the idea of matching every single word, which contains much more structurized info than lstm-like algorithm that makes it a language repeater.. so another key is a huge bunch of data needed, and much expensive calculation since this two will never be available in china (Hail CCF), don't think of it as long as your time is precious.

ConnorJL commented 4 years ago

Google actually has a system you can use to build a BPE encoder: https://github.com/google/sentencepiece

It's not exactly the same as OpenAI's, so you'd need to adapt encoder.py to use the new model, but in theory it should work just fine. I think the main insights from GPT2 is the scaled up transformer architecture, but BPE also surely adds a lot.

ConnorJL / GPT2

Input Chinese, the predicted is Japanese. #5