Generating checksums with duplicates plugin fails

lhupitr commented 6 years ago

Problem

I can't get the duplicates plugin to generate checksums (neither CRC nor md5sum) when following the examples suggested in the documentation: https://beets.readthedocs.io/en/stable/plugins/duplicates.html

Running this command in verbose (-vv) mode:

$ beet -vv duplicates -C 'ffmpeg -i {file} -f crc -' title:"Blank Generation"

Led to this problem:

Sending event: pluginload
artresizer: method is (2, (7, 0, 7))
thumbnails: using IM to write metadata
thumbnails: using GIO to compute URIs
inline: adding item field multidisc
library database: /net/nfs4/server/mediacentre/Music/Catalogs/beets_seeding.blb
library directory: /net/nfs4/server/mediacentre/Music/Library
Sending event: library_opened
duplicates: key ffmpeg on item /net/nfs4/server/mediacentre/Music/Library/Richard Hell & The Voidoids/(1977) Blank Generation/07 Blank Generation.flac not cached:computing checksum
duplicates: failed to checksum /net/nfs4/server/mediacentre/Music/Library/Richard Hell & The Voidoids/(1977) Blank Generation/07 Blank Generation.flac: Command 'ffmpeg -i b'/net/nfs4/server/mediacentre/Music/Library/Richard Hell & The Voidoids/(1977) Blank Generation/07 Blank Generation.flac' -f crc -' returned non-zero exit status 1.
duplicates: all keys ['ffmpeg'] on item /net/nfs4/server/mediacentre/Music/Library/Richard Hell & The Voidoids/(1977) Blank Generation/07 Blank Generation.flac are null or empty: skipping
Sending event: cli_exit

However running directly from ffmpeg produces a checksum and exits with 0:

$ ffmpeg -i '/media/mediacentre/Mediacentre/Music/Library/Richard Hell & The Voidoids/(1977) Blank Generation/07 Blank Generation.flac' -f crc -

...
    Metadata:
      encoder         : Lavc57.107.100 pcm_s16le
    Side data:
      replaygain: track gain - -4.800000, track peak - 0.000023, album gain - -3.150000, album peak - 0.000023, 
Invalid return value 0 for stream protocol
CRC=0xfd05c9a4
size=       0kB time=00:02:45.20 bitrate=   0.0kbits/s speed= 354x    
video:0kB audio:28458kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown

Setup

OS: Arch linux
Python version: 3.6.4
beets version: 1.4.6
Turning off plugins made problem go away (yes/no): N/A

My configuration (output of beet config) is:

importfeeds:
    relative_to: !!binary |
        L25ldC9uZnM0L3NlcnZlci9tZWRpYWNlbnRyZS9NdXNpYy9MaWJyYXJ5
    formats: m3u
    dir: /media/mediacentre/Mediacentre/Music/Playlists/
    m3u_name: imported.m3u
    absolute_path: no
lyrics:
    bing_lang_from: []
    auto: no
    google_API_key: REDACTED
    sources: google
    bing_client_secret: REDACTED
    bing_lang_to:
    google_engine_ID: REDACTED
    genius_api_key: REDACTED
    fallback:
    force: no
    local: no
directory: /net/nfs4/server/mediacentre/Music/Library
library: /net/nfs4/server/mediacentre/Music/Catalogs/beets_seeding.blb

plugins: inline edit fromfilename zero rewrite fuzzy info missing random types replaygain smartplaylist duplicates badfiles play chroma absubmit acousticbrainz mbsync mbsubmit lastimport embedart fetchart thumbnails importfeeds lyrics wlg popularity
pluginpath: ~/projects/whatlastgenre/plugin/beets/beetsplug
per_disc_numbering: yes
threaded: no

import:
    write: yes
    copy: yes
    hardlink: no
    autotag: yes
    incremental: no
    resume: yes
    quiet_fallback: skip
    none_rec_action: ask
    timid: yes
    log: ~/.config/beets/beetslog.txt
    detail: yes
    bell: yes
    from_scratch: yes

replace:
    '[\\/]': _
    ^\.: _
    '[\x00-\x1f]': _
    \.$: _
    \s+$: ''
    ^\s+: ''
    ^-: _

ui:
    color: yes

paths:
    default: $albumartist_credit/($original_year) $album%aunique{albumartist album year,albumtype label catalognum albumdisambig}/%if{$multidisc,Disc${disc}%if{$disctitle, - $disctitle}/}$track $title
    singleton: $artist/-Misc-/$title
    comp: Various Artists/$album%aunique{albumartist album year,albumtype label catalognum albumdisambig} ($year)/%if{$multidisc,Disc$disc%if{$disctitle, - $disctitle}/}$track $title
    ext:cue: $albumpath/$artist - $album
    ext:log: $albumpath/$artist - $album
    ext:jpg: $albumpath/cover
    ext:jpeg: $albumpath/cover
    ext:png: $albumpath/cover
item_fields:
    multidisc: 1 if disctotal > 1 else 0

match:
    strong_rec_thresh: 0.04
    medium_rec_thresh: 0.25
    rec_gap_thresh: 0.25
    max_rec:
        missing_tracks: medium
        unmatched_tracks: medium

musicbrainz:
    searchlimit: 5
art_filename: cover
fetchart:
    auto: no
    cautious: yes
    minwidth: 450
    maxwidth: 1010
    enforce_ratio: 1.5%
    sources: filesystem albumart amazon wikipedia coverart fanarttv google
    google_key: REDACTED
    store_source: yes
    cover_names:
    - cover
    - front
    - art
    - album
    - folder
    google_engine: 001442825323518660753:hrh5ch1gjzm
    fanarttv_key: REDACTED
embedart:
    auto: no
    maxwidth: 300
    compare_threshold: 0
    ifempty: no
    remove_art_file: no
thumbnails:
    auto: yes
    force: no
    dolphin: no
replaygain:
    auto: yes
    backend: bs1770gain
    overwrite: yes
    targetlevel: 89
    r128: [Opus]
    chunk_at: 5000
    method: replaygain
wlg:
    auto: no
    force: yes
    count: 4
    whitelist: /home/user/.whatlastgenre/genres.txt
    separator: ', '
mpdstats:
    rating_mix: 0.75
zero:
    auto: yes
    fields: images
    update_database: yes
    keep_fields: []
smartplaylist:
    relative_to: /net/nfs4/server/mediacentre/Music/Library/
    playlist_dir: /net/nfs4/server/mediacentre/Music/Playlists/
    playlists: [{name: Loved.m3u, query: 'loved:1'}, {name: Genre_acoustic-country-folk-americana.m3u, album_query: ['genre:Folk', 'genre:Country', 'genre:Acoustic', 'genre:Americana']}]
    auto: yes
types:
    loved: int
    play_count: int
lastfm:
    user: user
    api_key: REDACTED
play:
    command: mpv
    use_folders: no
    relative_to:
    raw: no
    warning_threshold: 100
copyartifacts:
    extensions: .cue .log .jpg .jpeg .png
    print_ignored: yes
acousticbrainz:
    auto: no
    force: no
    tags: []
absubmit:
    auto: no
    extractor: /usr/bin/streaming_extractor_music
chroma:
    auto: yes
acoustid:
    apikey: REDACTED
mbsubmit:
    format: $track. $title - $artist ($length)
    threshold: medium
lastimport:
    per_page: 500
    retry_limit: 3
duplicates:
    album: no
    checksum: ''
    copy: ''
    count: no
    delete: no
    format: ''
    full: no
    keys: []
    merge: no
    move: ''
    path: no
    tiebreak: {}
    strict: no
    tag: ''
fuzzy:
    prefix: '~'
    threshold: 0.7
missing:
    count: no
    total: no
    album: no
pathfields: {}
album_fields: {}
rewrite: {}
edit:
    albumfields: album albumartist
    itemfields: track title artist album
    ignore_fields: id path

sampsyo commented 6 years ago

Interesting! I don't have an obvious explanation for what could be going wrong—maybe someone (perhaps you) can do some digging to find out what ffmpeg's output is and why it's exiting with an error when we invoke it.

lhupitr commented 6 years ago

As mentioned running the ffmpeg command directly produces a crc checksum without a non zero exit status, also the problem seems to be beyond ffmpeg as the same error occurs with the suggested md5sum command (which also functions fine when executed directly on the command line).

sampsyo commented 6 years ago

Right—so the question is what it’s doing (and what its output is) when it’s invoked by beets instead of manually.

ghost commented 5 years ago

So, maybe this is a Python thing, but I find this suspicious:

duplicates: failed to checksum /home/ser/Music/Incoming/Electric Light Orchestra/All Over The World_ The Very Best Of ELO [13]/18 Strange Magic.mp3: Command 'md5sum b'/home/ser/Music/Incoming/Electric Light Orchestra/All Over The World_ The Very Best Of ELO [13]/18 Strange Magic.mp3'' returned non-zero exit status 1.

Same as in OP -- what's with the spurious "b" in front of the file name? No matter which command I pass in (sha512sum, md5sum, ffmpeg), they all exit status 1 -- and if duplicates really is putting a "b" in there, no wonder, because that's not the {file}name.

Edit: copy/paste included some newlines; removed those for accuracy.

schmod commented 5 years ago

Indeed, that seems to be the problem. In Python 3, because the file path is a bytes object instead of a string, it's being passed to the shell as `"b'/path/to/file'".

Changing the relevant line from:

args = [p.format(file=item.path) for p in shlex.split(prog)]

to

args = [p.format(file=item.path.decode('utf-8')) for p in shlex.split(prog)]

seems to fix things.

A few problems remain, however. Both md5sum and sha512sum include the file's path in their output:

$ md5sum ~/my/file.mp3
ea187811890ede95aa618ecba4f27f57  ./my/file.mp3

Because beets uses this output to determine duplicates, it's never going to mark anything as a duplicate.

Additionally, because beets caches the checksums (using the first argument of the command), if you somehow mistype your checksum command, once you've cached bad fingerprints from md5sum, you're stuck with them forever.

sampsyo commented 5 years ago

Thanks for investigating! It seems like you're on the right track. However, not all filenames are encoded with UTF-8, so just using a hard-coded .decode('utf-8') will throw exceptions and produce incorrect output sometimes. The right way to do this will probably be to turn the template into bytes and interpolate on that—because the final command will need to be bytes, not Unicode strings.

That problem with including filenames in the output does seem bad! Maybe we should change the advice to instead recommend that people somehow pipe data into md5sum's standard input rather than passing the filename on its command line?

stale[bot] commented 4 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

lhupitr commented 4 years ago

This is still a problem for me.

stale[bot] commented 4 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

lhupitr commented 4 years ago

I'm not sure what needs to be done here, just keeping it open.

wisp3rwind commented 4 years ago

Thanks for investigating! It seems like you're on the right track. However, not all filenames are encoded with UTF-8, so just using a hard-coded .decode('utf-8') will throw exceptions and produce incorrect output sometimes. The right way to do this will probably be to turn the template into bytes and interpolate on that—because the final command will need to be bytes, not Unicode strings.

I think this issue is solved in the convert plugin: https://github.com/beetbox/beets/blob/b659ad6b0c7e7be35f6d39df09b740b4ed69f5f5/beetsplug/convert.py#L207-L233 So the solution is probably to move this to a separate function in beets.util and apply that in the duplicates plugin.

That problem with including filenames in the output does seem bad! Maybe we should change the advice to instead recommend that people somehow pipe data into md5sum's standard input rather than passing the filename on its command line?

I think that's not easily doable without writing a separate script and passing it as the argument to -C. Piping effectively requires running the command through a shell with all the implications about proper escaping. For example, the following (untested!) incantations might work, but I guess they're not nice advice for the docs due to the amount of escaping required (which means it's non-trivial to modify these commands):

beet dup -C 'sh -c "md5sum < \"$1\"" {file}' ...
beet dup -C 'sh -c "md5sum \"$1\" | awk '"'"'{print $1}'"'"'" {file}' ...

Note that if the plugin would run the -Cargument through command_output(..., shell=True), the {file} itself would need to be quoted properly, which doesn't exactly simplify things.

So maybe the advice should be to create a script

#! /usr/bin/env sh

md5sum < "$1"

or

#! /usr/bin/env sh

md5sum "$1" | awk '{printf $1}'

and use it with

beet dup -C 'myscript {file}' ...

I'll remove the needinfo label since I think that the above implies that there's a clear path forward. I won't implement any of this myself, though, I'm not familiar with the duplicates plugin at all and don't use it myself.

sampsyo commented 4 years ago

Thanks, @wisp3rwind! I think you have the right fix there.

zwiebelspaetzle commented 3 years ago

This is an issue for me as well.

arogl commented 3 years ago

I have replaced line 200 with the following block and it is now computing checksums, currently testing on Ubuntu 20.04, Python 3.8

        if not six.PY2:
            if platform.system() == 'Windows':
                args = [p.format(file=item.path.decode(util._fsencoding()))
                        for p in shlex.split(prog)]
            else:
                args = [p.format(file=item.path.decode(util.arg_encoding(),
                        'surrogateescape')) for p in shlex.split(prog)]

I tried to add a

prog = prog.decode(util.arg_encoding(), 'surrogateescape'))

but I got an error:

AttributeError: 'str' object has no attribute 'decode'

I am not sure if the prog needs the decoding?

Thoughts?

beetbox / beets

Generating checksums with duplicates plugin fails #2873

Problem

Setup