Train Perplexity is very low but test is pretty high for Thai language

Hello everyone,

I train an Elmo model for the Thai language with Wikipedia around 3 days (200,000 batches).

Due to Thai doesn't have white space to separate word, we use newmm tokenization from pyThaiNLP to generate word split by whitespace same as English from input sentence.

My problem is the perplexity of training and testing from my Wikipedia dataset. Training perplexity is very confusing. Perplexity initial at 250 and drop quickly to 10 at Batch 3000 and perplexity value stable around 3 - 5 at Batch 30000.

But after trying the model with testing data, test perplexity is around 200-350 with avg 250.

My option file is

{"bidirectional": true, "char_cnn": {"activation": "relu", "embedding": {"dim": 16}, "filters": [[1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024]], "max_characters_per_token": 50, "n_characters": 261, "n_highway": 2}, "dropout": 0.5, "lstm": {"cell_clip": 3, "dim": 4096, "n_layers": 2, "proj_clip": 3, "projection_dim": 512, "use_skip_connections": true}, "all_clip_norm_val": 10.0, "n_epochs": 5, "n_train_tokens": 88248455, "batch_size": 16, "n_tokens_vocab": 628280, "unroll_steps": 20, "n_negative_samples_batch": 8192}(elmo)

I try to change dropout and batch size to experiment that it affects or not but this factor does not solve my problem.

Do you guy have any idea to fix this problem?

allenai / bilm-tf

Train Perplexity is very low but test is pretty high for Thai language #202