Open raraz15 opened 3 weeks ago
Thank you for this duplicate analysis!
We could provide alternative deduplicated versions for autotagging.tsv
(--> autotagging_dedup.tsv
) and all derivative tagging subsets (autotagging_top50tags.tsv
, autotagging_genre.tsv
, autotagging_instrument.tsv
, and autotagging_moodtheme.tsv
--> autotagging_top50tags_dedup.tsv
, autotagging_genre_dedup.tsv
, autotagging_instrument_dedup.tsv
, and autotagging_moodtheme_dedup.tsv
) and their splits.
However, for tagging this is not very critical. So we could start with creating the autotagging_dedup.tsv
first.
Description
The mtg-jamendo dataset contains multiple instances of duplicate audio files, which are bitwise exact copies but have different filenames. These duplicates might cause issues in applications that rely on data uniqueness, such as audio fingerprinting.
Steps to Reproduce
raw_30s
directory.Expected Behavior
Each audio file should be unique without any bitwise duplicates.
Actual Behavior
Out of 55,701 MP3 files in the
raw_30s
directory, a small percentage are found to be exact duplicates:Examples of Duplicates
mtg-jamendo/raw_30s/audio/34/1056334.mp3
andmtg-jamendo/raw_30s/audio/41/1077641.mp3
mtg-jamendo/raw_30s/audio/34/1399334.mp3
andmtg-jamendo/raw_30s/audio/19/1389919.mp3
Additional Context
This issue may not affect all use cases but could be critical for applications that require distinct audio samples, such as for training machine learning models or for audio fingerprinting applications.
Suggested Fix
A thorough audit and removal of duplicate files, or at least documentation in the dataset metadata indicating the presence of duplicates.