ValvePython / vpk

📦 Open, Search, Extract and Create VPKs in python
MIT License
167 stars 13 forks source link

Extracting & packing files with UTF8 names doesn't work #9

Closed JCRPaquin closed 5 years ago

JCRPaquin commented 5 years ago

The fix for this previously was to switch to latin-1 encoding, but the correct encoding seems to be ansi.

I noticed this when I was repacking the latest version of Dota Auto Chess, which contains a file with a UTF8 name; the packer crashed with latin-1 encoding (can't encode the name), and mangled the name when I encoded with utf-8 encoding.

ansi is the only encoding that seems to work correctly.

rossengeorgiev commented 5 years ago

Any stracktrace? Steps to reproduce? Version of windows?

JCRPaquin commented 5 years ago

Python 3.6.3, Windows 10

To reproduce:

  1. Make a root folder
  2. Add a file to the root folder with UTF-8 encoded characters in the name
  3. Run vpk -c <root folder name> <vpk name>

The script should crash because the latin-1 codec can't encode UTF-8 characters. No stack trace will be output.

JCRPaquin commented 5 years ago

Extracting UTF-8 encoded file names causes the file names to be altered during extraction.

For a vpk containing a file with the following name: 不用了的.vjs_c (encoded in ansi as b'\xe4\xb8\x8d\xe7\x94\xa8\xe4\xba\x86\xe7\x9a\x84')

Extraction results in the name: 不用了的.vjs_c (actual string can't be copied/pasted properly; also doesn't encode to ansi and doesn't decode to string below)

The name above is the direct ansi encoding of the utf-8 string: "不用了的"

rossengeorgiev commented 5 years ago

The problem is that whatever packed the VPK could use any encoding. There is nothing in the format that indicates, which encoding was used. The package has no place trying to guess the source encoding. I think the following changes will address the problem, and allow for guessing if needed.

JCRPaquin commented 5 years ago

If the tool used relative paths and avoided using strings for any paths then there'd be no need to encode/decode.

Would that work?

Limiting encoding to just the target file/directory name should make things easier.