clalancette / pycdlib

Python library to read and write ISOs
GNU Lesser General Public License v2.1
147 stars 38 forks source link

os.walk fails on Shift-JIS encoded ISO-9660 filesystem #101

Closed einstein95 closed 1 year ago

einstein95 commented 2 years ago

Yeah, apparently (at least in 2003 and using Toast ISO 9660 Builder ("HAVE A NICE DAY")), you were able to make an ISO filesystem that wasn't ASCII. This Japanese clipart disc is what causes the problem: https://archive.org/download/GorippaPetit19/Gorippa%20Petit%2019.iso.

Current code fails with

  File "pycdlib/pycdlib.py", line 5932, in walk
    encoded = name.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8c in position 0: invalid start byte

My suggestion is to add an "encoding" keyword that can override the existing ones in https://github.com/clalancette/pycdlib/blob/e67d63512281b2966e2f8d5e2fa4f2a5f3579544/pycdlib/pycdlib.py#L5952-L5974

clalancette commented 2 years ago

Yeah, I see what you mean.

Unfortunately, due to the way things are implemented, this is not going to be as easy as adding an encoding parameter. We use and store the strings internally to do all sort of things, like looking up the directory records, etc. Probably the right fix here is to store bytestrings internally, and only convert to/from the encoding on the user-facing APIs, but it is a big internal change to do that. I'll have to think about this further.

clalancette commented 1 year ago

Actually, I was totally wrong about this. You were right, we just needed an encoding argument in the walk API. I've added that (and a test) in 04812daf69c2453db06b5fefbb9cdf1f1fbb62d0 . So this should be fixed now. Thanks for the report!