beetbox / beets

music library manager and MusicBrainz tagger
http://beets.io/
MIT License
12.78k stars 1.82k forks source link

duplicates: Default settings detect spurious duplicates when IDs are missing #586

Closed sampsyo closed 10 years ago

sampsyo commented 10 years ago

The duplicates plugin's grouping does not currently attempt to detect null fields. This results in duplicates being reported when several tracks (or albums) are missing metadata. For the default key set, for example, items that don't have album or track IDs set all appear to be duplicates. This is explainable but it's confusing output for the uninitiated.

Perhaps the plugin should not print out objects in the "null" grouping—those for which all the key fields are empty. Would that make sense to everybody?

varunagrawal commented 10 years ago

Hey, can I take this up? A newbie to the project so it seems this ma be something I can tackle.

pedros commented 10 years ago

@sampsyo: yeah, it makes complete sense. Regarding what constitutes a null field: I know there was some discussion about it viz. None vs. empty strings, etc. (well, that's certainly a whole lot of Latin abbreviations!) Anyway, the point is, how best to check for these?

@varunagrawal: feel free to tackle it if you want. The relevant functions would be _group and _duplicates.

sampsyo commented 10 years ago

Cool. I'm guessing the most predictable way to classify null values is to special-case only None and the empty string—other falsey values (including False) should probably not be considered nulls.

@varunagrawal Absolutely! Please ask questions if you need help getting familiar with beets internals. Or just open a pull request when you're ready.

ian-kelling commented 10 years ago

Yes, In my library, I have 434 of the same duplicate with the default settings :(

pedros commented 10 years ago

@ian-kelling: did you delete a comment or something? I got an email with more information:

I have 6554 tracks, duplicates plugin says 6553 are duplicates. I do have a few hundred with no musicbrainz info, but that is a bit ridiculous.

But you now say you have 434 dupes. Which is it? Anyway, I'll need a few things:

  1. output of beet config
  2. hash of latest commit (or latest version if you're running from a release)
  3. output of beet info on two of the tracks you get reported as duplicates.

If I can see that information, I can see what's going wrong and fix it or help you fix it.

pedros commented 10 years ago

Any change those few hundred tracks with no musicbrainz metadata are the duplicated ones? If so, then we should have a fix very shortly.

ian-kelling commented 10 years ago

Sorry for the delay. I will be responsive if you want me to test anything. The issue still happening with the latest sources. mar 11: d091c7e5b4ddc2cccd311796b0e676e2b2187193

when I initially wrote 6553 duplicates, I was doing something wrong, which I quickly realized and changed.

$ beet config directory: /i/music

import: log: /a/dt/beetlog.log move: yes quiet_fallback: skip

match: strong_rec_thresh: 0.07 library: /a/bin/data/musiclibrary.blb

plugins: discogs duplicates web discogs: source_weight: 0.5 duplicates: album: no full: no format: '' keys: [mb_trackid, mb_albumid] move: no tag: no path: no copy: no count: no checksum: delete: no web: host: '' port: 8337

ian-kelling commented 10 years ago

edit: grabbing the info plugin and output

ian-kelling commented 10 years ago

$ beet info 01\ Spring\ -\ Concerto\ #1\ in\ E\ major\ -\ 1\ -\ Allegro.flac /i/music/Vivaldi/Four Seasons (1960 EMI 1988)/01 Spring - Concerto #1 in E major - 1 - Allegro.flac title: Spring - Concerto #1 in E major - 1 - Allegro artist: Vivaldi artist_sort: artist_credit: album: Four Seasons (1960 EMI 1988) albumartist: albumartist_sort: albumartist_credit: genre: Classical composer: grouping: year: 1988 month: 0 day: 0 track: 1 tracktotal: 0 disc: 0 disctotal: 0 lyrics: comments: bpm: 107 comp: False mb_trackid: mb_albumid: mb_artistid: mb_albumartistid: albumtype: label: acoustid_fingerprint: acoustid_id: mb_releasegroupid: asin: catalognum: script: language: country: albumstatus: media: albumdisambig: disctitle: encoder: rg_track_gain: -1.76 rg_track_peak: 0.791321 rg_album_gain: -0.59 rg_album_peak: 0.871582 original_year: 0 original_month: 0 original_day: 0 length: 197.466666667 bitrate: 716341 format: FLAC samplerate: 44100 bitdepth: 16 channels: 2 album art: False

$ beet info 02\ Spring\ -\ Concerto\ #1\ in\ E\ major\ -\ 2\ -\ Largo.flac /i/music/Vivaldi/Four Seasons (1960 EMI 1988)/02 Spring - Concerto #1 in E major - 2 - Largo.flac title: Spring - Concerto #1 in E major - 2 - Largo artist: Vivaldi artist_sort: artist_credit: album: Four Seasons (1960 EMI 1988) albumartist: albumartist_sort: albumartist_credit: genre: Classical composer: grouping: year: 1988 month: 0 day: 0 track: 2 tracktotal: 0 disc: 0 disctotal: 0 lyrics: comments: bpm: 1 comp: False mb_trackid: mb_albumid: mb_artistid: mb_albumartistid: albumtype: label: acoustid_fingerprint: acoustid_id: mb_releasegroupid: asin: catalognum: script: language: country: albumstatus: media: albumdisambig: disctitle: encoder: rg_track_gain: 8.78 rg_track_peak: 0.217041 rg_album_gain: -0.59 rg_album_peak: 0.871582 original_year: 0 original_month: 0 original_day: 0 length: 165.066666667 bitrate: 598113 format: FLAC samplerate: 44100 bitdepth: 16 channels: 2 album art: False

pedros commented 10 years ago

@ian-kelling: right, so that makes sense, we still haven't pushed the empty field fix. @varunagrawal wanted to tackle it, so we've been on hold for the moment. I'll wait another day, and then push a fix myself if necessary. In the meanwhile, you can either (1) grab mb_trackids for those 400-something tracks you have or (2) change the configuration option keys (or pass -k on the command-line) to a list of other keys that should be unique. You could do, for example, -k artist album title, or for a more accurate metric, -C 'ffmpeg -i {file} -f crc -', which will decode the audio portion from each track and checksum it.

ian-kelling commented 10 years ago

Thanks. No rush on my account, I've gotten the duplicates in my collection figured out :)

pedros commented 10 years ago

Fixed with e8f6781.