capreolus-ir / capreolus

A toolkit for end-to-end neural ad hoc retrieval
https://capreolus.ai
Apache License 2.0
95 stars 32 forks source link

Make EmbedText faster #63

Closed kevinmartinjos closed 4 years ago

kevinmartinjos commented 4 years ago

Right now when creating an embedding matrix random embeddings are used to OOV words. zerounk would set oov words as zero. Both are extremes.

Consequences:

  1. Slow embedding creation
  2. Noisly signals due to the cosine similarity between two randomly initialized oov word embeddings

Solution : For OOV words,

  1. Set similarity as 1 if there's an exact match (unlike zerounk, which always set it as 0)
  2. Set similarity as 0 if it's not an exact match
  3. Avoid building stoi - the pymagnitude embedding already has a vocabulary. Simply add OOV terms to this vocab
andrewyates commented 4 years ago

Here's a SimilarityMatrix that handles this.

The idea is that OOV terms get negative indices, padding=0, and in-vocab terms get positive indices.