Create deduplication data for hand-annotating training data for a simple classifier

generate every (string) similarity measure score for the names of two subjects
threshold the mean of every score (to reduce data)
export duplicate candidates to a file in the following way:
- for every new subject there should be a list of subjects in the knowledge base that can be linked
- the grouped subjects should be ranked by the mean of the scores in descending order
- every subject should contain all the subject data, the similarity measure scores and the mean of the scores
- export format should be json
a python script should then use the exported data and enable manually annotating it

bpn1 / ingestion