allenai / bilm-tf

Tensorflow implementation of contextualized word representations from bi-directional language models
Apache License 2.0
1.62k stars 452 forks source link

Train Perplexity is very low but test is pretty high for Thai language #202

Open SuphanutN opened 5 years ago

SuphanutN commented 5 years ago

Hello everyone,

I train an Elmo model for the Thai language with Wikipedia around 3 days (200,000 batches).

Due to Thai doesn't have white space to separate word, we use newmm tokenization from pyThaiNLP to generate word split by whitespace same as English from input sentence.

My problem is the perplexity of training and testing from my Wikipedia dataset. Training perplexity is very confusing. Perplexity initial at 250 and drop quickly to 10 at Batch 3000 and perplexity value stable around 3 - 5 at Batch 30000.

image

image

image

But after trying the model with testing data, test perplexity is around 200-350 with avg 250.

image

My option file is

{"bidirectional": true, "char_cnn": {"activation": "relu", "embedding": {"dim": 16}, "filters": [[1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024]], "max_characters_per_token": 50, "n_characters": 261, "n_highway": 2}, "dropout": 0.5, "lstm": {"cell_clip": 3, "dim": 4096, "n_layers": 2, "proj_clip": 3, "projection_dim": 512, "use_skip_connections": true}, "all_clip_norm_val": 10.0, "n_epochs": 5, "n_train_tokens": 88248455, "batch_size": 16, "n_tokens_vocab": 628280, "unroll_steps": 20, "n_negative_samples_batch": 8192}(elmo)

I try to change dropout and batch size to experiment that it affects or not but this factor does not solve my problem.

Do you guy have any idea to fix this problem?