Open dennlinger opened 1 year ago
For Cleaner
(#41), this can be chosen with different functionalities. The default (and currently only function that is supported) is adding every encountered reference and summary, and not adding any future sample that contains a previously seen text.
For the
summaries.analysis.Analyzer
, I'm working on a tool that should be detecting duplicates in the training data. This rather trivial looking problem does pose some challenges in dealing with resulting duplicates, however:A
appears in the same form in different splits (training and validation). Does this count as a single duplicate, or two?A
has the same reference text as sampleB
. In addition,A
also has the same summary text asC
. Again the question of counting duplicates and choosing deletions remains, but different this time, sinceB
andC
might be fully distinct, making this an even less obvious choice (or rather, less arbitrary).A preliminary way might be to simply count samples multiple times for multiple problems, especially when considering that this might not be a very frequent problem.