Perplexity returns very random values even after warmup

I'm trying to calculate perplexity using the checkpoint model provided here: https://github.com/allenai/bilm-tf#can-you-provide-the-tensorflow-checkpoint-from-training

But it looks like the returned numbers are too large and random.

This is the code I run (based on the tests in test_training.py):

from bilm.data import BidirectionalLMDataset
from bilm.training import load_vocab, load_options_latest_checkpoint, test
if __name__ == "__main__":
    data_folder = "/path/to/checkpoint/folder/"
    _options, _ckpt_file = load_options_latest_checkpoint(data_folder)
    _vocab = load_vocab(data_folder + 'vocab-2016-09-10.txt', 50)
    prefix = "/path/to/test.txt"
    _data = BidirectionalLMDataset(prefix, _vocab, test=True)
    _perplexity = test(_options, _ckpt_file, _data, batch_size=1)

My test file:

i like cookies .
my hands are dirty .
i like cookies .
how about a nice steak ?
leaves are falling in september .
my name is john snow .
i like chicken soup .
i want to play in the snow .
i like cookies .

And I'm getting batch perplexity between 2 and 20 000 (on a non-synthetic test I get values in the millions).

Is this normal? Does it make sense to use the log of that value?

I'm running it on Ubuntu with latest code in master, on CPU.

allenai / bilm-tf

Perplexity returns very random values even after warmup #189