Closed WGH- closed 4 years ago
Could you run git ls-files -z > /tmp/git.output
and attach the /tmp/git.output file?
b'test.\xc2\x00'
Interesting. Could also try to archive it with plain git archive
into tar, gz, zip and bz2 formats and attach these files?
git-archive --format
accepts only zip
and tar
. Since bz2 and gz are simply compressed tar, I guess there's little point attaching all the combinations.
Intestingly, zip
prints a warning:
$ git archive --format zip HEAD > issue71.zip
warning: path is not valid UTF-8: test.�
https://docs.python.org/3/library/os.path.html
Unfortunately, some file names may not be representable as strings on Unix, so applications that need to support arbitrary file names on Unix should use bytes objects to represent path names. Vice versa, using bytes objects cannot represent all file names on Windows (in the standard mbcs encoding), hence Windows applications should use string objects to access all files.
This sounds like some major pain, lol.
Just for information, what's your OS and filesystem that can represent such paths?
By the way, if you unarchive those archives, will the output result in the identical filename?
Just for information, what's your OS and filesystem that can represent such paths?
Linux.
On Unix, filenames are byte strings ending with null byte with components separated by forward slashes, and no encoding is enforced at all. Usually UTF-8 is used, though.
By the way, if you unarchive those archives, will the output result in the identical filename?
In case of tar, yes.
In zip, though, Info-ZIP replaces the weird byte with -
, and 7-zip replaces it with Â
(b'\xc3\x82'
).
I see the following solutions at the moment
Could you elaborate on your use case?
My use case is a web server serving files with weird non-UTF-8 filenames to expose bugs in HTTP client code.
Git is already incompatible with certain filesystems and OSes by design: for example, Windows doesn't have Unix file permissions, but these permissions are always saved in Git repositories.
Do you use tar as your target format?
Yes.
What Python version do you use?
I tested on Python 3.6 and Python 2.7.
For Python 3.2+ there is a simpler solution thanks to fsencode / fsdecode. For 2 a dependency like chardet
might be needed.
Should be fixed in 1.20.0
https://github.com/WGH-/git-archive-all-bug1
This is where it happens: https://github.com/Kentzo/git-archive-all/blob/fed1f48f1287c84220be08d63181a2816bde7a64/git_archive_all.py#L416-L425