clalancette / pycdlib

Python library to read and write ISOs
GNU Lesser General Public License v2.1
147 stars 38 forks source link

[docs] Update the doc page on file extraction to mention `joliet_path` + [feature request] Explicit API for getting file offsets + [kudo] Nice lib! #127

Open vadimkantorov opened 2 months ago

vadimkantorov commented 2 months ago

UPD: I guess in 2024 pure ISO files are not very common, so for getting nice file names out it would be awesome to showcase auto-detection or use of joliet_path in:

Thanks!

Hi!

I'm trying out pycdlib for working with TexLive ISO distribution files: https://tug.ctan.org/systems/texlive/Images/texlive2024-20240312.iso, the end goal would be computing the file offsets and using mmap(...) to read to the individual files in the ISO (as then we can virtualize the files using fmemopen(...)).

import sys, io
import pycdlib
iso = pycdlib.PyCdlib()
iso.open('texlive2024-20240312.iso')
for child in iso.list_children(iso_path='/'):
    print(child.file_identifier())
extracted = io.BytesIO()
iso.get_file_from_iso_fp(extracted, iso_path='/README.;1')
print(extracted.getvalue().decode('utf-8'))
iso.close()

prints

b'.'
b'..'
b'ARCHIVE'
b'AUTORUN.INF;1'
b'DOC.HTM;1'
b'INDEX.HTM;1'
b'INSTALL_.;1'
b'INSTALL_.BAT;1'
b'LICENSE.CTA;1'
b'LICENSE.TL;1'
b'README.;1'
b'README.USE;1'
b'README_H.DIR'
b'README_T.DIR'
b'RELEASE_.TXT;1'
b'SOURCE'
b'TEXLIVE_'
b'TLPKG'
b'TL_TRAY_.EXE;1'
b'_MKISOFS.;1'
For the introductory information to TeX Live, see the directories
readme-txt.dir (plain text files) or readme-html.dir/ (HTML files).
The material is available in several languages.

If I mount the ISO on my Windows, I get the following dir listing:

D:\>dir
 Volume in drive D is TeXLive2024
 Volume Serial Number is 6532-0811

 Directory of D:\

02/11/2024  12:46 AM                91 .mkisofsrc
09/28/2006  06:31 PM             2,098 LICENSE.CTAN
11/20/2019  04:36 AM             5,267 LICENSE.TL
05/08/2016  04:35 PM               182 README
08/09/2008  03:39 PM               250 README.usergroups
03/12/2024  03:22 AM    <DIR>          archive
05/29/2014  10:22 AM                40 autorun.inf
03/11/2024  02:44 AM         1,719,204 doc.html
04/20/2022  12:51 AM             1,852 index.html
02/05/2024  07:23 PM           125,030 install-tl
05/13/2023  09:26 PM             5,083 install-tl-windows.bat
05/05/2023  05:38 PM    <DIR>          readme-html.dir
05/05/2023  05:38 PM    <DIR>          readme-txt.dir
03/12/2024  03:21 AM               368 release-texlive.txt
03/12/2024  12:03 AM    <DIR>          source
03/12/2024  03:21 AM    <DIR>          texlive-doc
03/07/2023  10:43 PM            49,664 tl-tray-menu.exe
03/12/2024  03:22 AM    <DIR>          tlpkg
              12 File(s)      1,909,129 bytes
               6 Dir(s)               0 bytes free

Why are the filenames from pycdlib coming out always in lowercase and contracted? (note install-tl-windows.bat coming out as INSTALL_.BAT;1 or README as README.;1) and why do they come out nicely in Windows dir?

Is it because the TexLive ISO file is using some tricky ISO standard extension? How does one get nice filenames listing with pycdlib?

Thanks a lot!

vadimkantorov commented 2 months ago

Okay, I figured out how to get nice names for TexLive ISO files. I adapted the auto mode from https://github.com/clalancette/pycdlib/blob/master/tools/pycdlib-extract-files. Maybe it would be nice if such auto mode was supported directly in list_children(...) (and others) API - e.g. by introducing an argument auto_path="" (and if not for back-compat, maybe it could become the default if a path is specified as an ordinal arg and not a kwarg).

For this TexLive ISO file, what is strange is that auto detects rock ridge, but using rr_path in place of joliet_path again makes file names not nice.

Even if a new arg is not introduced, it would be nice to have a note on joliet_path in the Examples doc: https://clalancette.github.io/pycdlib/example-opening-existing-iso.html and https://clalancette.github.io/pycdlib/example-extracting-data-from-iso.html

So the remaining question on file offsets seems worked-around in

so can hope for a more explicit API/example for getting file offsets / file sizes (basically multiplying child.orig_extent_loc by iso.logical_block_size and maybe placing it into child.data_offset).

import sys, io
import pycdlib
iso = pycdlib.PyCdlib()
iso.open('../texlive2024-20240312.iso')
if iso.has_udf():
    pathname = 'udf_path'
elif iso.has_rock_ridge():
    pathname = 'rr_path'
elif iso.has_joliet():
    pathname = 'joliet_path'
else:
    pathname = 'iso_path'
print(pathname)
for child in iso.list_children(joliet_path='/'):
    print(child.file_identifier().decode('utf-8'))
extracted = io.BytesIO()
iso.get_file_from_iso_fp(extracted, joliet_path='/README')
print(extracted.getvalue().decode('utf-8'))
iso.close()