colonialjelly / name-matching

1 stars 1 forks source link

Try a language model over edits #4

Open DallanQ opened 3 years ago

DallanQ commented 3 years ago

Here's an idea for calculating similarity

  1. compute edit operations to turn record name into tree name; e.g., smythe -> smith => [s/s, m/m, y/i, t/t, h/h, e/]
  2. construct multiple sequences for each edit sequence based upon the number of co-occurrences
  3. Build a language model (AWD-LSTM from fastai or Transformer) from the sequences.
  4. use the language model to compute the loss of the edit sequences needed to transform every record name to every tree name (performance hack: just tree names over a minimum cosine similarity)
  5. loss of the edit sequence is the "similarity" between the two names