elliotnunn / machfs

Library for reading and writing Macintosh HFS volumes
https://pypi.org/project/machfs/
MIT License
51 stars 5 forks source link

Japanese #2

Open rvanlaar opened 3 years ago

rvanlaar commented 3 years ago

Hey hello,

Over at scummvm we're working on adding support for Mac Japanese games. A first step is to be able to read them. Thanks to machfs we're able to read the disks with ease. Our challange starts with the filename encoding. Those Japanese files are in shift_jis. I've added an example of the filename machfs puts out. Encoding and decoding it back leads to the correct result.

"≤›¿∞»Øƒ ¥∏ΩÃfl€∞◊ 3.0 ôŸ¿fi".encode("mac_roman").decode("shift_jis")
'インターネット エクスプローラ 3.0 フォルダ'

What do you think is the best way to achieve this directly with machfs?

elliotnunn commented 3 years ago

I am happy to hear that you find machfs useful. ScummVM is a great project!

It was a mistake to hardcode machfs to assume Mac Roman filenames. It would have been better to treat filenames as raw bytes, and perform conversions only when needed to sync with the host filesystem.

But for a quick fix to access your data, commit 6ae4baa on the shift-jis-hack branch should work.

rvanlaar commented 3 years ago

Thank you for your quick response and thanks for the kind words on ScummVM.

Raw bytes would make it easier to do the encoding outside of machfs. The filenames we need in the end are encoded in a punycode variant. That is because ScummVM also runs on exotic platforms that lack utf-8 support for filenames

Regarding shift-jis: I see your change is a nice onliner. Maybe we can specify the encoding when initializing the volume. For us it's all about reading hfs, not writing.

As a prototype I changed mac_roman to shift_jis in the machfs. That worked to some extent. The problem I've hit now is that mac_japanese is not exactly the same as shift-jis. For examle \xfe is the ™ symbol. Working on that now.

b'QuickTime\xfe \x89\xb9\x90F\x91\xce\x89\x9e\x95\\'.decode("shift_jis")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'shift_jis' codec can't decode byte 0xfe in position 9: illegal multibyte sequence
rvanlaar commented 3 years ago

An update: The encoding problem was solved on out end. Luckily mac_roman is a single byte encoding and nothing is lost when decoding.

Since mac_japanese is different from shift_jis we ended up writing out own mac_japanese decoder. https://github.com/scummvm/scummvm/commit/86f6c137f5f7e85c558752d43336b78c16b832a4

What would be great if in the future machfs can output filenames as bytestrings.