We compute similarity based on several things: missions statement overlap, twitter overlap, etc. It may be worth merging these into a single score. Here are two ideas:
The simple way: take a weighted average of the scores, where weights are either uniform or tuned by hand based on a totally subjective analysis of the results
The more complex way: Tune the weights automatically using known NTEE codes or causes. This can be mapped to the following classification problem:
For each pair of nonprofits, create a classification instance where the label is 1 if they have the same cause or NTEE code, 0 otherwise
Train a classifier on this data, using the scores from each source as features.
The resulting weights should tell us "how important is this score for reproducing cause/ntee classifications"
Of course, we dont want to reproduce NTEE codes exactly, but this may nudge the weights in the right direction
We compute similarity based on several things: missions statement overlap, twitter overlap, etc. It may be worth merging these into a single score. Here are two ideas: