issues
search
bpn1
/
ingestion
Ingestion Pipeline
Apache License 2.0
7
stars
1
forks
source link
Create deduplication data for hand-annotating training data for a simple classifier
#646
Open
janehmueller
opened
6 years ago
janehmueller
commented
6 years ago
generate every (string) similarity measure score for the names of two subjects
threshold the mean of every score (to reduce data)
export duplicate candidates to a file in the following way:
for every new subject there should be a list of subjects in the knowledge base that can be linked
the grouped subjects should be ranked by the mean of the scores in descending order
every subject should contain all the subject data, the similarity measure scores and the mean of the scores
export format should be json
a python script should then use the exported data and enable manually annotating it