Open ajenhl opened 10 years ago
The approach I'm going to take with this is to modify the existing diff functionality in the following ways:
Further, the report function would gain an extra element, --threshold, which filters the results passed to it in the same way as --threshold above.
Note that the new column would be added to intersect results, and the new report function could also filter those results.
So, by default, nothing would change for the end user (a diff would give exactly the same results). But one could get a fuzzy diff by proposing a numeric value for --threshold, guaranteeing that each n-gram exists in one subcorpus preponderantly, rather than entirely.
It might be that there is a need to allow for the threshold to apply not to two individual text's ratios, but to the ratio across all texts within each subcorpora.
Rather than a binary (not-)unique-to-sub-corpus, it would be useful to provide a graded result, based on, perhaps, frequency of occurrence. This would at least mostly avoid the issue whereby a single instance of an n-gram in a single witness among potentially thousands of other texts in a sub-corpus will ensure that that n-gram does not occur in the results, despite otherwise appearing solely in another sub-corpus.