cpllab / lm-zoo

Easy black-box access to state-of-the-art language models
https://cpllab.github.io/lm-zoo/
MIT License
14 stars 6 forks source link

Implausibly high surprisal for </s> in ngram model #56

Open rlevy opened 4 years ago

rlevy commented 4 years ago

This seems wrong: why does we get such a huge surprisal for sentence-end after a period? Input file was:

This is a short sentence.

Command & output:

$ lm-zoo get-surprisals ngram ~/tmp/sentences.txt
reading /opt/srilm/checkpoint/model.lm in binary format
sentence_id token_id    token   surprisal
1   1   this    5.29354
1   2   is  3.1117
1   3   a   2.92768
1   4   short   9.45191
1   5   sentence    12.0459
1   6   .   3.6674900000000004
1   7   </s>    28.1537

Doesn't happen for GRNN (the -0.0 is a tiny bit funny but probably not worrying about):

$ lm-zoo get-surprisals GRNN ~/tmp/sentences.txt
sentence_id token_id    token   surprisal
1   1   This    0.0
1   2   is  1.7249029999999999
1   3   a   1.4204510000000001
1   4   short   8.294603
1   5   sentence    10.343164
1   6   .   3.59838
1   7   <eos>   -0.0
bnicenboim commented 3 years ago

isn't it strange that "This" has a surprisal of 0.0 as well?? @rlevy , I haven't seen any reaction in the issues or the chat (https://gitter.im/lm-zoo/community), is this project still alive?