TheFeshy / BookSift

A plugin for Calibre to do book identification and duplicate detection using the book's text instead of metadata
2 stars 0 forks source link

Create a tool to "train" the algorithm parameters #21

Open TheFeshy opened 13 years ago

TheFeshy commented 13 years ago

We have a large number of parameters to tune the algorithm for optimal effectiveness; we should have an automated way to iteritively set them, test their successfullness, and report the best choices. We can use the report of the test generator we have now; the best criteria I think are the following:

No false positives No misses greatest distance between "lowest hit" and "highest miss"

Our current parameters are pretty good: with our sample size of 25, we get a minhit of 62, and a maxmiss of 28; that's a spread of 34%! However, because we use functions to set up the parameters, we could tune those functions even more for various size books, etc. Once we get it working correctly, we should try a larger sample size as well.