Closed bmcfee closed 11 months ago
Confirmed that this is due to this line:
https://github.com/fatiando/pooch/blob/a965902d26015453ac00269597a23b83d85db644/pooch/core.py#L660
For example, the 1.6.0 behavior would have been:
In [2]: line = "Karissa_Hobbs_-_Let's_Go_Fishin'.ogg"
In [3]: line.split()
Out[3]: ["Karissa_Hobbs_-_Let's_Go_Fishin'.ogg"]
In 1.7, it becomes:
In [11]: shlex.split(line)
Out[11]: ['Karissa_Hobbs_-_Lets_Go_Fishin.ogg']
(note the missing single-quotes)
A quick fix here is to use shlex.quote
:
In [12]: shlex.split(shlex.quote(line))
Out[12]: ["Karissa_Hobbs_-_Let's_Go_Fishin'.ogg"]
I don't know if this will be safe across platforms, but it would at least fix the bug I'm seeing.
For the reference: The original bug that let to using shlex
is https://github.com/fatiando/pooch/issues/313.
I think the use of shlex.quote
would still fix #313:
In [2]: shlex.split(shlex.quote("foo' bar baz'.txt"))
Out[2]: ["foo' bar baz'.txt"]
Ok, I started prepping a PR for this and realized that it's a bit more subtle than just sticking shlex.quote in there, since it will apply to the entire line. If a filename ends with a quote, it will muck up the space parsing, and the whole parser falls apart.
The problem as I see it boils down to some ambiguity in the format of the registry file, in particular due to the option of custom urls as a third token. We don't know in advance how many spaces there will be in a line, and that makes splitting tricky.
If we didn't have custom urls, it would be enough to do:
filename, hash = line.rsplit(' ', maxsplit=2)
where rsplit will group all but the last space separator into filename
.
If we always have a custom URL, it would be equally simple:
filename, hash, url = line.rsplit(' ', maxsplit=3)
but this will be incorrect if the url is absent and the filename contains a space.
What we really need is a way to first check if the final token is a hash or a url, and then act accordingly:
# Assuming URLs in registry files are properly URL-encoded, i.e. ' ' → '%20'
prefix, last = line.rsplit(' ', maxsplit=2)
if is_url(last):
url = last
filename, hash = prefix.split(' ', maxsplit=2)
else:
filename, hash, url = prefix, last, None
Now, there's a problem that URLs (per RFC 3986) if not fully formed can look an awful lot like hashes. We can't rely on detecting just a :
because this is also valid in hashes (md5:abcdef...
).
We also can't get away with detecting :/
, or even ://
because this would rule out DOI urls.
I suspect the simplest solution would be to explicitly match URLs against a subset of supported schemes (http:, https:, ftp:, sftp:, doi:) and treat anything else as if it's a hash. If it's not actually a hash, this will eventually fail later on.
Does this sound reasonable? If so, I'd be happy to prep a PR implementing the above strategy.
Hi @bmcfee, thanks for opening this issue.
I'm sorry that the last release broke backward compatibility. This is something we always try to avoid through exhaustive testing, but as you could tell we weren't considering the case where the filenames contain apostrophes.
I think the issue you are facing is part of a larger discussion: what type of standard do we want to use for filenames within Pooch. After we merged #315 I think we inadvertently established that we would follow the POSIX standard (see Parsing rules in shlex when POSIX is True).
Therefore, if a filename contains a special character, like a space or an apostrophe, we can always use the backslash to escape it. In your particular case, you can modify your registry file to:
Karissa_Hobbs_-_Let\'s_Go_Fishin\'.hq.ogg ...
Karissa_Hobbs_-_Let\'s_Go_Fishin\'.ogg ...
Karissa_Hobbs_-_Let\'s_Go_Fishin\'.txt ...
and it should work. We use the same strategy for spaces, as was introduced in #315.
If we try to modify the load_registry()
method after every time we encounter a corner case that isn't supported, I think we'll be modifying its code after every release, which isn't sustainable and it's prone to introduce more backward compatibility breaks.
So, in my opinion, special characters like spaces, double quotes or apostrophes in filenames should be handled by inserting a backslash before them.
What do you think? Does it sounds like a good compromise?
Thanks @santisoler - I agree with that assessment. I'm totally happy if pooch wants to explicitly adopt POSIX-style syntax for filenames, and I think the confusion only arose because the syntax was implicitly defined (1.7) or under-specified (<1.6). I opened the PR because it wasn't clear if this was intended behavior or a bug to be fixed, but if it's intended behavior we can close it out.
I think what this means for my project in particular is that we need to put out a post-release to upper-bound the supported pooch versions, and then revise our registry file with an updated lower bound.
One question though: how does this formatting work when creating registry files from a pooch object? I don't see any logic for escaping filenames on write (make_registry
) in #315 , and from a glance at https://github.com/fatiando/pooch/blob/a965902d26015453ac00269597a23b83d85db644/pooch/hashes.py#L218-L222 I expect it will not work as expected currently.
@bmcfee thanks for being very understanding with this. I think it's best to assume the POSIX syntax is intended (though it was originally an oversight). make_registry
wasn't updated to work with the new syntax, though, as you pointed out. That still needs to be done.
No worries - sounds like a good plan to me!
Alright, closing this one since we have #374 and #369 to handle the issues identified here. Thanks again for reporting @bmcfee!
Description of the problem:
This just popped up in the librosa test suite after the environments upgraded from pooch 1.6 to 1.7.
We have a registry of data files that are fetched by pooch, as listed here: https://github.com/librosa/librosa/blob/c800e74f6a6ec5c27e0fa978d7355943cce04359/librosa/util/example_data/registry.txt Note that three of the files (lines 22-24) contain an apostrophe:
In pooch 1.6, this works as expected. However, after upgrading to 1.7, reading these files fails (see below). I believe this is a regression caused by #315 .
Full code that generated the error Minimal example via librosa:
This will call out to our example loader: https://github.com/librosa/librosa/blob/c800e74f6a6ec5c27e0fa978d7355943cce04359/librosa/util/files.py#L43-L98
Full error message
System information
conda list
below:Output of conda list (it's bloated, but you asked for it :))