cta-observatory / eventio_kaitai

Prototyping a Katai struct implementation of eventio data structures
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

Compilation of kaitai python module #5

Open zonca opened 1 year ago

zonca commented 1 year ago

@jpivarski how do you recommend we handle the compilation of the kaitai python module?

The simplest thing would be to just compile it locally and add the generated file to the repository. However, we could also make it a part of the Python package, I found for example this repository (we do not need such a complicated system, but we could simplify it):

https://github.com/trailofbits/polyfile/blob/master/compile_kaitai_parsers.py

jpivarski commented 1 year ago

Compiling the Kaitai file-reader requires a Scala dependency, kaitai_struct_compiler, in addition to a normal CMake, make, gcc-or-clang compilers. For a Python package to provide that as a runtime capability, it could be distributed by conda-forge to get the normal compilers (compilers, make, and cmake), but there isn't a good way to get Scala through the conda-forge channel; there's a very old package in bioconda, but mixing channels is both a bad idea and can't be listed in the dependencies. On top of that, users would still need to get the kaitai_struct_compiler separately, and—until @ManasviGoyal's updates are integrated into the upstream kaitai_struct_compiler, users would have to get it from a non-standard source.

So unfortunately, I think the process of creating Kaitai file-readers, at least up to the generated C++ source code stage, has to be a specialized activity (until the above issues are resolved). Saving architecture-specific binaries in source control is not usually a good idea, but maybe the C++ source code files can be saved in source control. At least the diffs would be readable.

The Python extension is designed to dynamically pick up shared libraries with file-format specifics. The first step it takes is to load a so/dylib/dll shared library file representing one file format, produced by a KSY. So how about this (for now, at least):

The Python packages can be arranged like this: generic awkward_kaitai is independent of file format; it has the generic code to load a so/dylib/dll as a Python class instance with a load method for reading raw data. Meanwhile, collaboration-specific libraries like awkward_kaitai_sdss and awkward_kaitai_cta can list awkward_kaitai as a dependency and carry the platform-specific so/dylib/dll in their wheel.

awkward_cpp is an example of a library that just carries platform-specific compiled code. It has a very thin Python wrapper around it to identify where the so/dylib/dll is, within its installation, and load it: https://github.com/scikit-hep/awkward/blob/main/awkward-cpp/src/awkward_cpp/cpu_kernels.py. In particular, importlib_resources.files(awkward_cpp) / "lib" / name finds the name of the bundled so/dylib/dll and ctypes.cdll.LoadLibrary(str(libpath)) loads it. awkward_kaitai also works by loading functions with C signatures through ctypes.

zonca commented 1 year ago

I am new to the project, I need to go step by step. So for now I am experimenting just with plain kaitai, without awkward, see #1.

So, in this simplified case, I think the best thing would be to not commit the autogenerated code to the repository, but have Github actions generate it in order to run the tests.

Then, when we are ready to release a package on Github and PyPI, we can have a Github Action include the autogenerated file into the package, so that the user can skip the compilation step.