As outlined in #37, this moves some functions towards a more sample-centric functionality. Currently only the duplication detection in the Analyzer is now working on the full dataset, which is caused by a different filtering approach in Cleaner.
Speaking of, Cleaner is another utility that uses the primary functions from Analyzer to actually filter a dataset (or rather, several splits of the same dataset).
By default, it will apply some light filtering on lengths (e.g., removing samples with longer summaries than references), and also look for duplicates, although in a slightly different fashion than Analyzer, since it will actually have to deal with the correct removal as well.
As outlined in #37, this moves some functions towards a more sample-centric functionality. Currently only the duplication detection in the
Analyzer
is now working on the full dataset, which is caused by a different filtering approach inCleaner
.Speaking of,
Cleaner
is another utility that uses the primary functions fromAnalyzer
to actually filter a dataset (or rather, several splits of the same dataset). By default, it will apply some light filtering on lengths (e.g., removing samples with longer summaries than references), and also look for duplicates, although in a slightly different fashion thanAnalyzer
, since it will actually have to deal with the correct removal as well.