Closed JCRPaquin closed 5 years ago
Any stracktrace? Steps to reproduce? Version of windows?
Python 3.6.3, Windows 10
To reproduce:
vpk -c <root folder name> <vpk name>
The script should crash because the latin-1
codec can't encode UTF-8 characters. No stack trace will be output.
Extracting UTF-8 encoded file names causes the file names to be altered during extraction.
For a vpk containing a file with the following name:
ä¸ç”¨äº†çš„.vjs_c
(encoded in ansi
as b'\xe4\xb8\x8d\xe7\x94\xa8\xe4\xba\x86\xe7\x9a\x84'
)
Extraction results in the name:
ä¸ç¨äºç.vjs_c
(actual string can't be copied/pasted properly; also doesn't encode to ansi
and doesn't decode to string below)
The name above is the direct ansi
encoding of the utf-8
string:
"不用了的"
The problem is that whatever packed the VPK could use any encoding. There is nothing in the format that indicates, which encoding was used. The package has no place trying to guess the source encoding. I think the following changes will address the problem, and allow for guessing if needed.
encoding
argument, which will default to utf-8
(or platform specific, need to test that)encoding
argument, default to utf-8
(covers most of the cases). When encoding
is set to None
, the path is returned in bytes
, so encoding can be guessed etcIf the tool used relative paths and avoided using strings for any paths then there'd be no need to encode/decode.
Would that work?
Limiting encoding to just the target file/directory name should make things easier.
The fix for this previously was to switch to
latin-1
encoding, but the correct encoding seems to beansi
.I noticed this when I was repacking the latest version of Dota Auto Chess, which contains a file with a UTF8 name; the packer crashed with
latin-1
encoding (can't encode the name), and mangled the name when I encoded withutf-8
encoding.ansi
is the only encoding that seems to work correctly.