The algorithm does poorly identifying anthologies; books contained within anthologies show up as 20%-40% matches. Books with no relation sometimes show up as high as 15-20% in our test sample. We need a two-pronged approach:
Adjust "match" threshold dynamically based on the size difference between the two books.
If we get a match that is a low percentage, we might want to double-check it with our D-L algorithm, using a subset of
words from the original text (not the unique words!)
The cons of this approach are also twofold:
getting a new set of text from a file is potentially very slow
the D-L compare is very slow.
However, if these lookups are rare (and anthologies don't often make up a large portion of a collection, usually) the overall impact should be minimal. Testing will confirm.
The current fast algorithm beats all the slow algorithms I have tried for accuracy. We may re-investigate this with other algorithms, such as shingling, but for now I'll close this.
The algorithm does poorly identifying anthologies; books contained within anthologies show up as 20%-40% matches. Books with no relation sometimes show up as high as 15-20% in our test sample. We need a two-pronged approach:
The cons of this approach are also twofold:
However, if these lookups are rare (and anthologies don't often make up a large portion of a collection, usually) the overall impact should be minimal. Testing will confirm.