Open mariana-angelova opened 4 years ago
Lowercasing + properly tuned Levenshtein distance (8 in this particular case, probably up to 10-12 in the generic case (bigger values would generate false positives)) could help a lot here. Having in mind some minimum string length or some dynamic threshold. I'm on it.
There are duplicate rows in media register (eg "БТА, Българска телеграфна агенция, :" vs "БЪЛГАРСКА ТЕЛЕГРАФНА АГЕНЦИЯ"). Try to use SAMPLE (which is treated as an aggregate).