"'utf-8' codec can't decode byte" error when filename is not valid UTF-8

Kentzo / git-archive-all

A python script wrapper for git-archive that archives a git superproject and its submodules, if it has any. Takes into account .gitattributes

MIT License

372 stars 81 forks source link

"'utf-8' codec can't decode byte" error when filename is not valid UTF-8 #71

Closed WGH- closed 4 years ago

WGH- commented 5 years ago

https://github.com/WGH-/git-archive-all-bug1

$ git init
$ touch $'test.\xC2'
$ git add -A
$ git commit -m "initial commit"
$ ~/.local/bin/git-archive-all test.zip
'utf-8' codec can't decode byte 0xc2 in position 5: invalid continuation byte

This is where it happens: https://github.com/Kentzo/git-archive-all/blob/fed1f48f1287c84220be08d63181a2816bde7a64/git_archive_all.py#L416-L425

Kentzo commented 5 years ago

Could you run git ls-files -z > /tmp/git.output and attach the /tmp/git.output file?

WGH- commented 5 years ago

b'test.\xc2\x00'

Kentzo commented 5 years ago

Interesting. Could also try to archive it with plain git archive into tar, gz, zip and bz2 formats and attach these files?

WGH- commented 5 years ago

git-archive --format accepts only zip and tar. Since bz2 and gz are simply compressed tar, I guess there's little point attaching all the combinations.

Intestingly, zip prints a warning:

$ git archive --format zip HEAD > issue71.zip
warning: path is not valid UTF-8: test.�

issue71.zip issue71.tar.gz

WGH- commented 5 years ago

https://docs.python.org/3/library/os.path.html

Unfortunately, some file names may not be representable as strings on Unix, so applications that need to support arbitrary file names on Unix should use bytes objects to represent path names. Vice versa, using bytes objects cannot represent all file names on Windows (in the standard mbcs encoding), hence Windows applications should use string objects to access all files.

This sounds like some major pain, lol.

Kentzo commented 5 years ago

Just for information, what's your OS and filesystem that can represent such paths?

Kentzo commented 5 years ago

By the way, if you unarchive those archives, will the output result in the identical filename?

WGH- commented 5 years ago

Just for information, what's your OS and filesystem that can represent such paths?

Linux.

On Unix, filenames are byte strings ending with null byte with components separated by forward slashes, and no encoding is enforced at all. Usually UTF-8 is used, though.

WGH- commented 5 years ago

By the way, if you unarchive those archives, will the output result in the identical filename?

In case of tar, yes.

In zip, though, Info-ZIP replaces the weird byte with -, and 7-zip replaces it with Â (b'\xc3\x82').

Kentzo commented 5 years ago

I see the following solutions at the moment

Maintain the original name, but make an archive that is incompatible with certain formats (zip) and (file)systems
Fix the name by escaping invalid UTF-8 sequences
A mix of [1] and [2] at the cost of inconsistency

Could you elaborate on your use case?

WGH- commented 5 years ago

My use case is a web server serving files with weird non-UTF-8 filenames to expose bugs in HTTP client code.

Git is already incompatible with certain filesystems and OSes by design: for example, Windows doesn't have Unix file permissions, but these permissions are always saved in Git repositories.

Kentzo commented 5 years ago

Do you use tar as your target format?

WGH- commented 5 years ago

Yes.

Kentzo commented 5 years ago

What Python version do you use?

WGH- commented 5 years ago

I tested on Python 3.6 and Python 2.7.

Kentzo commented 5 years ago

For Python 3.2+ there is a simpler solution thanks to fsencode / fsdecode. For 2 a dependency like chardet might be needed.

Kentzo commented 4 years ago

Should be fixed in 1.20.0