Open SeguinBe opened 4 years ago
Hi, Benoit --
Thanks! Yes, this limitation is an intentional optimization. By assuming that the anchor points, which by construction appear once in a given document, are aligned in ascending order, we can find the longest alignment in O(n log n) time. If anchor points could repeat, we would need to fall back on standard Levenshtein-style alignments with O(n^2) time complexity, which is non-trivial for longer documents. Even then, we would not be able to find arbitrary swaps as in your example. If we didn't allow arbitrary permutations of sets of 4 paragraphs, we could model a restricted class of permutations in O(n^6) time (inversion transduction grammars). Aligning arbitrary permutations would require exponential time.
For the moment, our primary practical work-around to these issues is to chunk the documents finely "enough" so that the monotonic alignment constraint is not a problem.
It is interesting to think about other assumptions could we make to get better practical alignments that realistically capture text reuse. Having fewer uniqueness constraints on which passages in one document are covered by another (something with the complexity of an HMM) could get us back to O(n^2). Pruning the candidate alignments we need to check with a locality-sensitive hash in dense embedding space would be an interesting approach to get back under quadratic time.
Do you have example tasks that would benefit from these sorts of finer-grained alignments?
Hi David,
Thanks for the detailled answer, really showing the work you put in this project.
Yeah, I was quickly thinking in my head and was realising it was most probably a necessary optimization. I was wondering if an iterative approach could work: extract the longest increasing subsequence, remove the found pairs, extract the l.i.s. on the remaining ones, etc... Do you think that could do the trick? It would stay n.logn (roughly) and would permit the limitation I was highlighting. A potential issue would be duplicate passages (especially because of the gap filling), but I am not too knowledgeable about your code base to think of the limitations.
In practice, I work with whole books directly and not short articles. Matching can be complex especially if both books are heavily quoting the same texts (see the following image, if each book is used as a separate document, all the green reordered parts are missed, and it's most likely the most interesting part)
In the end, I am considering each page separately but part of a common 'series' (how I could generate the previous image). That assumes we do not have reordered parts in the same page, and am afraid I might also miss short passages which are overlapping two pages, and would get split up.
A way to alleviate the last problem would be to add before/after context to the text of each page, and then re-merge the results a-posteriori, but it feels a bit like kicking the can down the road.
Hello,
First of all, what a wonderful tool.
I encountered a limitation though, as matching passages in a pair of documents have to be in the same order.
For instance, if one takes two arbitrary paragraphs (p1, p2), and set document 1 to be p1+p2 but document 2 to be p2+p1, passim will only find the match with the longest paragraph, ignoring the other paragraph (which is definitely matching).
As far as I understand, this comes from the increasingMatches function which seems to be extracting the longest increasing subsequence of matching n-grams.
How much of this is by design? As far as I understand, even replacing increasingMatches would not solve the issue as the increasing property seems to be assumed by following functions.