Notably, none of the duplication detection functions in Analyzer assume contamination across references and summaries, i.e., instances where the reference text of one data instance would be the summary of another one.
It would be interesting to see whether this is actually a problem (i.e., happens in the wild), but also should not be too difficult to implement regardless.
The only downside is that this can be quite costly in terms of computation, especially when using comparison methods other than exact.
Notably, none of the duplication detection functions in
Analyzer
assume contamination across references and summaries, i.e., instances where the reference text of one data instance would be the summary of another one.It would be interesting to see whether this is actually a problem (i.e., happens in the wild), but also should not be too difficult to implement regardless. The only downside is that this can be quite costly in terms of computation, especially when using comparison methods other than
exact
.