Try a language model over edits

Here's an idea for calculating similarity

compute edit operations to turn record name into tree name; e.g., smythe -> smith => [s/s, m/m, y/i, t/t, h/h, e/]
construct multiple sequences for each edit sequence based upon the number of co-occurrences
Build a language model (AWD-LSTM from fastai or Transformer) from the sequences.
use the language model to compute the loss of the edit sequences needed to transform every record name to every tree name (performance hack: just tree names over a minimum cosine similarity)
loss of the edit sequence is the "similarity" between the two names

colonialjelly / name-matching