Closed zplizzi closed 5 years ago
@matt-peters I presume you used the SpaCY tokenizer?
The ELMo model was trained on a corpus tokenized with the Moses Tokenizer. In practice I have used both the Moses tokenizer and spacy when tokenizing new text, there might a small difference when switching tokenizers, although in many cases it will be small.
Looking at https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl, the Moses tokenizer does the following by default:
$text =~ s/\&/\&/g; # escape escape
$text =~ s/\|/\|/g; # factor separator
$text =~ s/\</\</g; # xml
$text =~ s/\>/\>/g; # xml
$text =~ s/\'/\'/g; # xml
$text =~ s/\"/\"/g; # xml
$text =~ s/\[/\[/g; # syntax non-terminal
$text =~ s/\]/\]/g; # syntax non-terminal
Can you post some tokenised example sentences that contain such characters please?
Comparing cosine similarity of vectors for are
, 're
, 're
, will
, 'll
and 'll
in simple example sentences, it seems that the provided elmo model does not know about '
: The unescaped contractions are about 10 times more similar to the full form than the escaped ones.
This alone might still be explained by the common suffix and the shorter lengths of the sequence of characters the tokens do not have in common. To test further, I compared and
, &
and &
in a suitable context. Also here the unescaped token wins, though only with a 4.4 higher similarity.
These findings suggests that -no-escape
was used with the Moses tokeniser. @matt-peters Please confirm.
The model was trained on unescaped text, tokenized with moses tokenizer. So if you are using the MosesTokenizer
from nltk, pass escape=False
, e.g.
tokenizer = MosesTokenizer()
tokens = tokenizer.tokenize(text, escape=False)
If you'd like more details about the training dataset construction, please see reference information for the Billion Word Benchmark, https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark
Thanks for the pointers. Much appreciated.
In the ELMO tutorial "using ELMO interactively" section, it would be useful to mention what tokenizer should be used. Using a different tokenizer to the one used to train ELMO will likely result in worse performance - so a user needs to know what tokenizer was used to train the model, which I can't readily find.