LibraryOfCongress / bagit-python

Work with BagIt packages from Python.
http://libraryofcongress.github.io/bagit-python
216 stars 85 forks source link

Filename character sequence with tilde sometimes causes false error on Linux #152

Open akubot opened 3 years ago

akubot commented 3 years ago

Observed on Linux, a failure at line 931 of bagit.py in git commit hash code bff20d2

RHEL 7.5 (Maipo), Linux 3.10.0-1160.15.2.el7.x86_64 GNU/Linux

Python 3.7.4 using miniconda3

File names containing this 4-character sequence cause an "is unsafe" false error due to expansion of the tilde:

~$_-

To reproduce, create a file name such as

data/blah/~$_-expect_fail.doc

In Python v3, do this

import os, sys
f = "data/blah/~$_-expect_fail.doc"

# Note that Linux home directory is inserted in expansion
os.path.expandvars(f) 
'data/blah/~/home/users/jdoe-expect_fail.doc'

A similar test with filename that does not have the hyphen after the underscore

data/blah/~$_expect_success.doc

works OK, i.e. without unwanted expansion of the tilde.

acdha commented 3 years ago

That's certainly quite odd — does it reproduce if you simply run python -c 'import os.path; os.path.expanduser("data/blah/~$_-expect_fail.doc")'? Looking at the stdlib implementation, my first reaction is surprise that this isn't failing the getpwnam() lookup:

https://github.com/python/cpython/blob/3.8/Lib/posixpath.py#L228-L271

akubot commented 3 years ago

I modified your test slightly, it confirms the problem.

I changed it to expandvars, which is the one failing. and also wrapped it with a print()

The original command you posted does not throw an error but also doesn't print anything.

$ python3 -c 'import os.path; print(os.path.expandvars("data/blah/~$_-expect_fail.doc"))'
data/blah/~/home/users/jdoe-expect_fail.doc

I don't have an explanation for the issue, but somehow these 4 characters together expand on RHEL anyway.

acdha commented 3 years ago

Ah, I missed that and it explains the problem: $_ is a shell variable (last argument of the previous command, if memory serves), and that would explain why you're seeing this depending on the next character since it has to be something which stops parsing at the next character.

I'm wondering whether we want to remove that part of #88 on the theory that this is being overzealous. I do not want to be in the business of having to try to maintain a separate implementation of expandvars.

akubot commented 3 years ago

It looks like it's slightly more subtle -- the sequence "$_-" also causes the problem

$ python3 -c 'import os.path; print(os.path.expandvars("data/blah/$_-expect_fail.doc"))'
data/blah/~/home/users/jdoe-expect_fail.doc

while just "$_" does not

$ python3 -c 'import os.path; print(os.path.expandvars("data/blah/$_expect_success.doc"))'
data/blah/$_expect_success.doc

and similarly for just "~$_"

python3 -c 'import os.path; print(os.path.expandvars("data/blah/~$_expect_success.doc"))'
data/blah/~$_expect_success.doc

So it appears the hyphen helps trigger this. But that makes me think there are other cases lurking.

Regarding the bagit.py source code, I'm not sure what the correct choice would be. I suppose you could put some guard around the expandvars code, to check for the character sequences that mess it up on Linux?

But I'm not sure specifically what you would do, because you are currently matching a possibly expanded filename against the real one.

Probably better to discuss with regular contributors, and ask whether to back out the expandvars snippet completely? Then it should always pass on Linux.

BTW, a Windows user verified this issue does not occur there, which makes sense because this is a Linux expansion.

acdha commented 3 years ago

I was thinking that it might be good to discuss this with @runderwood to see whether we have a compelling reason to keep expandvars in here.

This variable is set by at least the popular Bourne shells and ones like Fish which try to be compatible so I suppose we could purge $_ at startup but that seems like it's just asking for some unexpected dependency to show up later.

runderwood commented 3 years ago

I don't really recall why the expandvars call is there, to be honest.

While I can see the reasoning that it's safer to disallow potentially dangerous variable expansion, I believe it's system-dependent so you don't really enforce a meaningful universal policy doing it this way.

I don't have a compelling argument for keeping it there if it's causing problems.