Closed carschno closed 8 years ago
The generation is now done by the Vectorizer, applied by the MalletEmbeddingsAnnotator if PARAM_ANNOTATE_UNKNOWN_TOKENS
is set to true.
I think unknown words should be default be annotated all with the same random vector. The embeddings format we have elsethread has support for storing a special unknown vector.
An alternative would be to generate random vectors for unknown words on the fly such that the same word always gets the same vector... not annotating them at all doesn't seem to make much sense... does it?
The
WordEmbeddingsAnnotator
currently only annotates tokens with aWordEmbedding
annotation for which embeddings are available. Unknown tokens just don't receive any annotation. Generally, I see three options to handle unknown wordsWith respect to the
WordEmbeddingsAnnotator
, I think about introducing a parameterPARAM_UNKNOWN_WORDS_VECTOR
that takes afloat[]
that represents the vector with which unknown words should be annotated. If the parameter is not specified (i.e. null), unknown words are not annotated at all (option 1). If the parameter is an empty array, a random vector is generated and each unknown word is annotated with that vector (option 2).