How to correctly count duplicates?

For the summaries.analysis.Analyzer, I'm working on a tool that should be detecting duplicates in the training data. This rather trivial looking problem does pose some challenges in dealing with resulting duplicates, however:

Duplicate data has to be held as a copy in memory, since we are not guaranteed that each sample has a unique identifier (especially not if it is an actual accidental duplicate); adding our own uuids, however, might mean additional passes through the data. Hash functions can have collisions, and are generally not much different from hashing it with a dictionary, anyways.
When encountering a duplicate, it might happen that more complex scenarios appear that prohibit easy clean-up:
- Sample A appears in the same form in different splits (training and validation). Does this count as a single duplicate, or two?
- As a follow-up, it is also unclear which one of the samples should be kept if we de-duplicate datasets. Should it be the one in the training set (meaning discarding the validation sample), or vice versa? Or should both be dropped?
- Alternatively, as an example, assume that sample A has the same reference text as sample B. In addition, A also has the same summary text as C. Again the question of counting duplicates and choosing deletions remains, but different this time, since B and C might be fully distinct, making this an even less obvious choice (or rather, less arbitrary).

A preliminary way might be to simply count samples multiple times for multiple problems, especially when considering that this might not be a very frequent problem.

dennlinger / summaries

How to correctly count duplicates? #34