dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

WordEmbeddingsAnnotator: how to deal with unknown words? #960

Closed carschno closed 8 years ago

carschno commented 8 years ago

The WordEmbeddingsAnnotator currently only annotates tokens with a WordEmbedding annotation for which embeddings are available. Unknown tokens just don't receive any annotation. Generally, I see three options to handle unknown words

  1. Don't annotate unknown words (current status)
  2. Annotate each unknown word with a new random vector
  3. Annotate each unknown word with the same random vector

With respect to the WordEmbeddingsAnnotator, I think about introducing a parameter PARAM_UNKNOWN_WORDS_VECTOR that takes a float[] that represents the vector with which unknown words should be annotated. If the parameter is not specified (i.e. null), unknown words are not annotated at all (option 1). If the parameter is an empty array, a random vector is generated and each unknown word is annotated with that vector (option 2).

carschno commented 8 years ago

The generation is now done by the Vectorizer, applied by the MalletEmbeddingsAnnotator if PARAM_ANNOTATE_UNKNOWN_TOKENS is set to true.

reckart commented 8 years ago

I think unknown words should be default be annotated all with the same random vector. The embeddings format we have elsethread has support for storing a special unknown vector.

An alternative would be to generate random vectors for unknown words on the fly such that the same word always gets the same vector... not annotating them at all doesn't seem to make much sense... does it?