NBU-DSCM-2020 / dscm006-semtech-group-project

NBU Data Science 2020 DSCM006 Semantic Technologies Group Project
0 stars 0 forks source link

Fix duplicate rows in media register #8

Open mariana-angelova opened 3 years ago

mariana-angelova commented 3 years ago

There are duplicate rows in media register (eg "БТА, Българска телеграфна агенция, :" vs "БЪЛГАРСКА ТЕЛЕГРАФНА АГЕНЦИЯ"). Try to use SAMPLE (which is treated as an aggregate).

ctapnec commented 3 years ago

Lowercasing + properly tuned Levenshtein distance (8 in this particular case, probably up to 10-12 in the generic case (bigger values would generate false positives)) could help a lot here. Having in mind some minimum string length or some dynamic threshold. I'm on it.