LiteralGenie / mu_reader_server

MIT License
0 stars 0 forks source link

Task: Implement endpoints to support series page #1

Open LiteralGenie opened 2 years ago

LiteralGenie commented 2 years ago
LiteralGenie commented 2 years ago

I'd like to keep the user info (the series on disk) separate from the MU data, but if instead I store the raw MU data separately (ie json, not normalized into tables), I can make this a future problem.

LiteralGenie commented 2 years ago

So to link series on disk with the MU metadata, the series (folder) name has to be compared with the titles in the MU data.

This implies some kind of similarity metric. As it turns out, the fuzzyset package will be the fastest for calculating this (see table below or string_comp_results.md for details).

At ~10ms per query, this implies ~100s for 10k series. With multiprocessing, this further implies ~25s with 4 workers and ~8s with 12 workers.

queries.json 5 20 100
jellyfish-damerau_levenshtein 673 285 676 2717
jellyfish-hamming 88 83 85 88
jellyfish-jaro 317 133 305 897
jellyfish-jaro_winkler 315 135 306 894
jellyfish-levenshtein 349 141 354 1739
textdistance-damerau_levenshtein 510 833 493 430
textdistance-hamming 1441 453 353 452
textdistance-jaro 495 510 382 506
textdistance-jaro_winkler 448 456 451 385
textdistance-levenshtein 445 429 445 447
fuzzyset 11 2 7 34
sqlite3-editdist3 2446 894 2508 10400
LiteralGenie commented 2 years ago

Scanning disk with Path.glob() is also pretty slow. Turns out Path.iterdir() is much faster because(?) it doesn't fetch metadata like file size.

(eg scanning 20k folders takes 15.3s vs 0.8s, and this is after running a few times so that whatever disk cache kicks in.)