erincandescent / lib9660

A simple ISO9660 file system implementation.
37 stars 4 forks source link

Joliet file name support and not working cat :( #1

Open vadimkantorov opened 4 weeks ago

vadimkantorov commented 4 weeks ago

Hi!

First, thanks for sharing your ISO format library. I'm a big fan of such approach and of single-file projects, e.g. https://github.com/richgel999/miniz/.

I'm trying to use tb9660 to work with large TexLive distribution in ISO (being able to mmap/open files from these large ISOs would be great and save time/space for extracting them when one does not have mount permissions): https://tug.ctan.org/systems/texlive/Images/texlive2024-20240312.iso

Calling ./tb9660 ../texlive2024-20240312.iso ls . gives:

ARCHIVE
AUTORUN.INF;1
DOC.HTM;1
INDEX.HTM;1
INSTALL_.;1
INSTALL_.BAT;1
LICENSE.CTA;1
LICENSE.TL;1
README.;1
README.USE;1
README_H.DIR
README_T.DIR
RELEASE_.TXT;1
SOURCE
TEXLIVE_
TLPKG
TL_TRAY_.EXE;1
_MKISOFS.;1

seems that tb9660 does not auto-detect Joliet/RockRidge?

02/11/2024  12:46 AM                91 .mkisofsrc
09/28/2006  06:31 PM             2,098 LICENSE.CTAN
11/20/2019  04:36 AM             5,267 LICENSE.TL
05/08/2016  04:35 PM               182 README
08/09/2008  03:39 PM               250 README.usergroups
03/12/2024  03:22 AM    <DIR>          archive
05/29/2014  10:22 AM                40 autorun.inf
03/11/2024  02:44 AM         1,719,204 doc.html
04/20/2022  12:51 AM             1,852 index.html
02/05/2024  07:23 PM           125,030 install-tl
05/13/2023  09:26 PM             5,083 install-tl-windows.bat
05/05/2023  05:38 PM    <DIR>          readme-html.dir
05/05/2023  05:38 PM    <DIR>          readme-txt.dir
03/12/2024  03:21 AM               368 release-texlive.txt
03/12/2024  12:03 AM    <DIR>          source
03/12/2024  03:21 AM    <DIR>          texlive-doc
03/07/2023  10:43 PM            49,664 tl-tray-menu.exe
03/12/2024  03:22 AM    <DIR>          tlpkg
              12 File(s)      1,909,129 bytes
               6 Dir(s)               0 bytes free

I've also tried printing the README file as ./tb9660 ../texlive2024-20240312.iso cat 'README.;1', and this prints gibberish:

�*4␦|���o�A��C�0�{�kM�NM`p��KX�ՅuG}�ǡ
T�9m�$&��k�9㐔��+����T�������`�R��l�k\�뉘 ��� 56C�

while in fact the README file contains the following:

For the introductory information to TeX Live, see the directories
readme-txt.dir (plain text files) or readme-html.dir/ (HTML files).
The material is available in several languages.

Calling it as L9660_DEBUG=1 ./tb9660 ../texlive2024-20240312.iso cat 'README.;1' gives:

| ---- dirent
| length        132
| xattr_length  0
| sector        19
| size          4096
| name          ""
| ---- end dirent
| ---- dirent
| length        96
| xattr_length  0
| sector        19
| size          4096
| name          ""
| ---- end dirent
| ---- dirent
| length        114
| xattr_length  0
| sector        22
| size          2074624
| name          "ARCHIVE"
| ---- end dirent
| ---- dirent
| length        124
| xattr_length  0
| sector        2721805
| size          40
| name          "AUTORUN.INF;1"
| ---- end dirent
| ---- dirent
| length        118
| xattr_length  0
| sector        2721806
| size          1719204
| name          "DOC.HTM;1"
| ---- end dirent
| ---- dirent
| length        122
| xattr_length  0
| sector        2722646
| size          1852
| name          "INDEX.HTM;1"
| ---- end dirent
| ---- dirent
| length        122
| xattr_length  0
| sector        2722647
| size          125030
| name          "INSTALL_.;1"
| ---- end dirent
| ---- dirent
| length        138
| xattr_length  0
| sector        2722709
| size          5083
| name          "INSTALL_.BAT;1"
| ---- end dirent
| ---- dirent
| length        126
| xattr_length  0
| sector        2722712
| size          2098
| name          "LICENSE.CTA;1"
| ---- end dirent
| ---- dirent
| length        124
| xattr_length  0
| sector        2722714
| size          5267
| name          "LICENSE.TL;1"
| ---- end dirent
| ---- dirent
| length        116
| xattr_length  0
| sector        2722717
| size          182
| name          "README.;1"
| ---- end dirent
�*4␦|���o�A��C�0�{�kM�NM`p��KX�ՅuG}�ǡ
T�9m�$&��k�9㐔��+����T�������`�R��l�k\�뉘 ��� 56C�

so tb9660 computes correctly the README file size as 182, but probably (?) calculates offset incorrectly and prints contents of some other file? (in the comment below, I found that the sector index is indeed 2722717, so if using block size 2048 it would give the correct byte offset 5576124416 - so unclear why tb9660 prints gibberish :(

Can one use nice Joliet file names as input to cat or at least have them decoded by lib9660 structures (essentially I'll need to get file offsets by a proper UTF-8 Joliet name/path)? Or is Joliet not supported at all?

Thank you!

vadimkantorov commented 4 weeks ago

Trying another library https://clalancette.github.io/pycdlib/, I am getting the following (README gets printed correctly)!:

For the introductory information to TeX Live, see the directories
readme-txt.dir (plain text files) or readme-html.dir/ (HTML files).
The material is available in several languages.
import sys, io
import pycdlib
iso = pycdlib.PyCdlib()
iso.open('texlive2024-20240312.iso')
for child in iso.list_children(joliet_path='/'):
    print(child.file_identifier(), child.orig_extent_loc * iso.logical_block_size, child_data_length)
extracted = io.BytesIO()
iso.get_file_from_iso_fp(extracted, joliet_path='/README')
print(extracted.getvalue().decode('utf-8'))
iso.close()

I compared the block index from orig_extent_loc and from tb9660 and it appears they are the same (assuming block size 2048), so quite unclear why tb9660 prints gibberish :(

southpawfishel commented 1 week ago

I just tried out this library and ran into the same issue. Turns out as per the ISO 9660 spec, filenames can end in a semicolon followed by a version number, so README.TXT;1 is to be expected. If you try tb9660 <iso_name> cat README.TXT I believe you should get the expected result.

Probably when using this library you would want to just truncate any part of the filename including and after the semicolon.

vadimkantorov commented 1 week ago

Well, for filenames it's the story of Joliet/RockRidge extensions and so forth.

L9660_DEBUG=1 ./tb9660 ../texlive2024-20240312.iso cat 'README.TXT' does not print anything useful.

L9660_DEBUG=1 ./tb9660 ../texlive2024-20240312.iso cat 'README.;1' does something non-trivial and prints the offset/size correctly, but doesn't print the contents correctly - instead prints something scrambled :(