Extracting & packing files with UTF8 names doesn't work

ValvePython / vpk

📦 Open, Search, Extract and Create VPKs in python

MIT License

167 stars 13 forks source link

Extracting & packing files with UTF8 names doesn't work #9

Closed JCRPaquin closed 5 years ago

JCRPaquin commented 5 years ago

The fix for this previously was to switch to latin-1 encoding, but the correct encoding seems to be ansi.

I noticed this when I was repacking the latest version of Dota Auto Chess, which contains a file with a UTF8 name; the packer crashed with latin-1 encoding (can't encode the name), and mangled the name when I encoded with utf-8 encoding.

ansi is the only encoding that seems to work correctly.

rossengeorgiev commented 5 years ago

Any stracktrace? Steps to reproduce? Version of windows?

JCRPaquin commented 5 years ago

Python 3.6.3, Windows 10

To reproduce:

Make a root folder
Add a file to the root folder with UTF-8 encoded characters in the name
Run vpk -c <root folder name> <vpk name>

The script should crash because the latin-1 codec can't encode UTF-8 characters. No stack trace will be output.

JCRPaquin commented 5 years ago

Extracting UTF-8 encoded file names causes the file names to be altered during extraction.

For a vpk containing a file with the following name: ä¸ç”¨äº†çš„.vjs_c (encoded in ansi as b'\xe4\xb8\x8d\xe7\x94\xa8\xe4\xba\x86\xe7\x9a\x84')

Extraction results in the name: ä¸ç¨äºç.vjs_c (actual string can't be copied/pasted properly; also doesn't encode to ansi and doesn't decode to string below)

The name above is the direct ansi encoding of the utf-8 string: "不用了的"

rossengeorgiev commented 5 years ago

The problem is that whatever packed the VPK could use any encoding. There is nothing in the format that indicates, which encoding was used. The package has no place trying to guess the source encoding. I think the following changes will address the problem, and allow for guessing if needed.

[ ] CLI: Add encoding argument, which will default to utf-8 (or platform specific, need to test that)
[ ] VPK Classes: add encoding argument, default to utf-8 (covers most of the cases). When encoding is set to None, the path is returned in bytes, so encoding can be guessed etc

JCRPaquin commented 5 years ago

If the tool used relative paths and avoided using strings for any paths then there'd be no need to encode/decode.

Would that work?

Limiting encoding to just the target file/directory name should make things easier.