dennlinger / summaries

A toolkit for summarization analysis and aspect-based summarizers
MIT License
11 stars 0 forks source link

How to correctly count duplicates? #34

Open dennlinger opened 1 year ago

dennlinger commented 1 year ago

For the summaries.analysis.Analyzer, I'm working on a tool that should be detecting duplicates in the training data. This rather trivial looking problem does pose some challenges in dealing with resulting duplicates, however:

  1. Duplicate data has to be held as a copy in memory, since we are not guaranteed that each sample has a unique identifier (especially not if it is an actual accidental duplicate); adding our own uuids, however, might mean additional passes through the data. Hash functions can have collisions, and are generally not much different from hashing it with a dictionary, anyways.
  2. When encountering a duplicate, it might happen that more complex scenarios appear that prohibit easy clean-up:
    • Sample A appears in the same form in different splits (training and validation). Does this count as a single duplicate, or two?
    • As a follow-up, it is also unclear which one of the samples should be kept if we de-duplicate datasets. Should it be the one in the training set (meaning discarding the validation sample), or vice versa? Or should both be dropped?
    • Alternatively, as an example, assume that sample A has the same reference text as sample B. In addition, A also has the same summary text as C. Again the question of counting duplicates and choosing deletions remains, but different this time, since B and C might be fully distinct, making this an even less obvious choice (or rather, less arbitrary).

A preliminary way might be to simply count samples multiple times for multiple problems, especially when considering that this might not be a very frequent problem.

dennlinger commented 1 year ago

For Cleaner (#41), this can be chosen with different functionalities. The default (and currently only function that is supported) is adding every encountered reference and summary, and not adding any future sample that contains a previously seen text.