Auto-tune similarity scores

We compute similarity based on several things: missions statement overlap, twitter overlap, etc. It may be worth merging these into a single score. Here are two ideas:

The simple way: take a weighted average of the scores, where weights are either uniform or tuned by hand based on a totally subjective analysis of the results
The more complex way: Tune the weights automatically using known NTEE codes or causes. This can be mapped to the following classification problem:
- For each pair of nonprofits, create a classification instance where the label is 1 if they have the same cause or NTEE code, 0 otherwise
- Train a classifier on this data, using the scores from each source as features.
- The resulting weights should tell us "how important is this score for reproducing cause/ntee classifications"
- Of course, we dont want to reproduce NTEE codes exactly, but this may nudge the weights in the right direction

dssg / givinggraph

Auto-tune similarity scores #21