Open LiteralGenie opened 2 years ago
I'd like to keep the user info (the series on disk) separate from the MU data, but if instead I store the raw MU data separately (ie json, not normalized into tables), I can make this a future problem.
So to link series on disk with the MU metadata, the series (folder) name has to be compared with the titles in the MU data.
This implies some kind of similarity metric. As it turns out, the fuzzyset package will be the fastest for calculating this (see table below or string_comp_results.md for details).
At ~10ms per query, this implies ~100s for 10k series. With multiprocessing, this further implies ~25s with 4 workers and ~8s with 12 workers.
queries.json | 5 | 20 | 100 | |
---|---|---|---|---|
jellyfish-damerau_levenshtein | 673 | 285 | 676 | 2717 |
jellyfish-hamming | 88 | 83 | 85 | 88 |
jellyfish-jaro | 317 | 133 | 305 | 897 |
jellyfish-jaro_winkler | 315 | 135 | 306 | 894 |
jellyfish-levenshtein | 349 | 141 | 354 | 1739 |
textdistance-damerau_levenshtein | 510 | 833 | 493 | 430 |
textdistance-hamming | 1441 | 453 | 353 | 452 |
textdistance-jaro | 495 | 510 | 382 | 506 |
textdistance-jaro_winkler | 448 | 456 | 451 | 385 |
textdistance-levenshtein | 445 | 429 | 445 | 447 |
fuzzyset | 11 | 2 | 7 | 34 |
sqlite3-editdist3 | 2446 | 894 | 2508 | 10400 |
Scanning disk with Path.glob()
is also pretty slow. Turns out Path.iterdir()
is much faster because(?) it doesn't fetch metadata like file size.
(eg scanning 20k folders takes 15.3s vs 0.8s, and this is after running a few times so that whatever disk cache kicks in.)