Closed lhupitr closed 3 years ago
Interesting! I don't have an obvious explanation for what could be going wrong—maybe someone (perhaps you) can do some digging to find out what ffmpeg's output is and why it's exiting with an error when we invoke it.
As mentioned running the ffmpeg command directly produces a crc checksum without a non zero exit status, also the problem seems to be beyond ffmpeg as the same error occurs with the suggested md5sum command (which also functions fine when executed directly on the command line).
Right—so the question is what it’s doing (and what its output is) when it’s invoked by beets instead of manually.
So, maybe this is a Python thing, but I find this suspicious:
duplicates: failed to checksum /home/ser/Music/Incoming/Electric Light Orchestra/All Over The World_ The Very Best Of ELO [13]/18 Strange Magic.mp3: Command 'md5sum b'/home/ser/Music/Incoming/Electric Light Orchestra/All Over The World_ The Very Best Of ELO [13]/18 Strange Magic.mp3'' returned non-zero exit status 1.
Same as in OP -- what's with the spurious "b" in front of the file name? No matter which command I pass in (sha512sum, md5sum, ffmpeg), they all exit status 1 -- and if duplicates really is putting a "b" in there, no wonder, because that's not the {file}
name.
Edit: copy/paste included some newlines; removed those for accuracy.
Indeed, that seems to be the problem. In Python 3, because the file path is a bytes
object instead of a string, it's being passed to the shell as `"b'/path/to/file'".
Changing the relevant line from:
args = [p.format(file=item.path) for p in shlex.split(prog)]
to
args = [p.format(file=item.path.decode('utf-8')) for p in shlex.split(prog)]
seems to fix things.
A few problems remain, however. Both md5sum
and sha512sum
include the file's path in their output:
$ md5sum ~/my/file.mp3
ea187811890ede95aa618ecba4f27f57 ./my/file.mp3
Because beets uses this output to determine duplicates, it's never going to mark anything as a duplicate.
Additionally, because beets caches the checksums (using the first argument of the command), if you somehow mistype your checksum command, once you've cached bad fingerprints from md5sum
, you're stuck with them forever.
Thanks for investigating! It seems like you're on the right track. However, not all filenames are encoded with UTF-8, so just using a hard-coded .decode('utf-8')
will throw exceptions and produce incorrect output sometimes. The right way to do this will probably be to turn the template into bytes and interpolate on that—because the final command will need to be bytes, not Unicode strings.
That problem with including filenames in the output does seem bad! Maybe we should change the advice to instead recommend that people somehow pipe data into md5sum
's standard input rather than passing the filename on its command line?
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This is still a problem for me.
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I'm not sure what needs to be done here, just keeping it open.
Thanks for investigating! It seems like you're on the right track. However, not all filenames are encoded with UTF-8, so just using a hard-coded
.decode('utf-8')
will throw exceptions and produce incorrect output sometimes. The right way to do this will probably be to turn the template into bytes and interpolate on that—because the final command will need to be bytes, not Unicode strings.
I think this issue is solved in the convert plugin:
https://github.com/beetbox/beets/blob/b659ad6b0c7e7be35f6d39df09b740b4ed69f5f5/beetsplug/convert.py#L207-L233
So the solution is probably to move this to a separate function in beets.util
and apply that in the duplicates plugin.
That problem with including filenames in the output does seem bad! Maybe we should change the advice to instead recommend that people somehow pipe data into
md5sum
's standard input rather than passing the filename on its command line?
I think that's not easily doable without writing a separate script and passing it as the argument to -C
. Piping effectively requires running the command through a shell with all the implications about proper escaping.
For example, the following (untested!) incantations might work, but I guess they're not nice advice for the docs due to the amount of escaping required (which means it's non-trivial to modify these commands):
beet dup -C 'sh -c "md5sum < \"$1\"" {file}' ...
beet dup -C 'sh -c "md5sum \"$1\" | awk '"'"'{print $1}'"'"'" {file}' ...
Note that if the plugin would run the -C
argument through command_output(..., shell=True)
, the {file}
itself would need to be quoted properly, which doesn't exactly simplify things.
So maybe the advice should be to create a script
#! /usr/bin/env sh
md5sum < "$1"
or
#! /usr/bin/env sh
md5sum "$1" | awk '{printf $1}'
and use it with
beet dup -C 'myscript {file}' ...
I'll remove the needinfo
label since I think that the above implies that there's a clear path forward. I won't implement any of this myself, though, I'm not familiar with the duplicates plugin at all and don't use it myself.
Thanks, @wisp3rwind! I think you have the right fix there.
This is an issue for me as well.
I have replaced line 200 with the following block and it is now computing checksums, currently testing on Ubuntu 20.04, Python 3.8
if not six.PY2:
if platform.system() == 'Windows':
args = [p.format(file=item.path.decode(util._fsencoding()))
for p in shlex.split(prog)]
else:
args = [p.format(file=item.path.decode(util.arg_encoding(),
'surrogateescape')) for p in shlex.split(prog)]
I tried to add a
prog = prog.decode(util.arg_encoding(), 'surrogateescape'))
but I got an error:
AttributeError: 'str' object has no attribute 'decode'
I am not sure if the prog needs the decoding?
Thoughts?
Problem
I can't get the duplicates plugin to generate checksums (neither CRC nor md5sum) when following the examples suggested in the documentation: https://beets.readthedocs.io/en/stable/plugins/duplicates.html
Running this command in verbose (
-vv
) mode:Led to this problem:
However running directly from ffmpeg produces a checksum and exits with 0:
Setup
My configuration (output of
beet config
) is: