allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.74k stars 2.24k forks source link

Tokenizer to use with ELMO #1933

Closed zplizzi closed 5 years ago

zplizzi commented 5 years ago

In the ELMO tutorial "using ELMO interactively" section, it would be useful to mention what tokenizer should be used. Using a different tokenizer to the one used to train ELMO will likely result in worse performance - so a user needs to know what tokenizer was used to train the model, which I can't readily find.

schmmd commented 5 years ago

@matt-peters I presume you used the SpaCY tokenizer?

matt-peters commented 5 years ago

The ELMo model was trained on a corpus tokenized with the Moses Tokenizer. In practice I have used both the Moses tokenizer and spacy when tokenizing new text, there might a small difference when switching tokenizers, although in many cases it will be small.

jowagner commented 5 years ago

Looking at https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl, the Moses tokenizer does the following by default:

    $text =~ s/\&/\&/g;   # escape escape
    $text =~ s/\|/\|/g;  # factor separator
    $text =~ s/\</\&lt;/g;    # xml
    $text =~ s/\>/\&gt;/g;    # xml
    $text =~ s/\'/\&apos;/g;  # xml
    $text =~ s/\"/\&quot;/g;  # xml
    $text =~ s/\[/\&#91;/g;   # syntax non-terminal
    $text =~ s/\]/\&#93;/g;   # syntax non-terminal

Can you post some tokenised example sentences that contain such characters please?

jowagner commented 5 years ago

Comparing cosine similarity of vectors for are, 're, &apos;re, will, 'll and &apos;ll in simple example sentences, it seems that the provided elmo model does not know about &apos;: The unescaped contractions are about 10 times more similar to the full form than the escaped ones.

This alone might still be explained by the common suffix and the shorter lengths of the sequence of characters the tokens do not have in common. To test further, I compared and, & and &amp; in a suitable context. Also here the unescaped token wins, though only with a 4.4 higher similarity.

These findings suggests that -no-escape was used with the Moses tokeniser. @matt-peters Please confirm.

matt-peters commented 5 years ago

The model was trained on unescaped text, tokenized with moses tokenizer. So if you are using the MosesTokenizer from nltk, pass escape=False, e.g.

tokenizer = MosesTokenizer()
tokens = tokenizer.tokenize(text, escape=False)

If you'd like more details about the training dataset construction, please see reference information for the Billion Word Benchmark, https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark

jowagner commented 5 years ago

Thanks for the pointers. Much appreciated.