EMS-TU-Ilmenau / chefkoch

A compute cluster cuisine for distributed scientific computing in python
Apache License 2.0
5 stars 1 forks source link

Investigate on dynamic importer handling in Python #51

Open ChristophWWagner opened 4 years ago

ChristophWWagner commented 4 years ago

Find out how to compile any python file (given by filename) into a python compiled bytecode container (.pyc) and how to import these .pyc (that may also lay outside a module) into the python namespace as an object.

There shall be functions that

JanKirchhofTU commented 4 years ago

This might be a good source to start with.

fabiankrieg commented 4 years ago

He links the following website in the DA which might be of use.

makley273 commented 4 years ago

Compiling should be relatively easy with py_compile or compileall

Methods for shell and in python commands shown here

Also here

makley273 commented 4 years ago

Importing pyc files should gereally be possible just like py files, but pyc files are highly specific to OS, python version, architecture, etc. For importing pyc files from others may decompilers like uncompyle6 or decompyle3 should be used to decompyle pyc and rather import the generated py files

ChristophWWagner commented 4 years ago

yes, pyc files are very short-lived, depending mostly on the local version of python. Since in the scope of a chefkoch project, the python version should be the same, this should not be a hindering factor.

However, maybe it is a good idea to also not forget what the original source file was. Just in case we later on might come across a situation where we need to handle special cases from version hell. Good point! I'll take this into consideration

pegro commented 4 years ago

May I ask what the use case is here? The Python interpreter will automatically create those files when a module is loaded. Is this somehow about trying to avoid compiling? That shouldn't be a huge performance hit, since python byte code is not highly optimized.

makley273 commented 4 years ago

May I ask what the use case is here? The Python interpreter will automatically create those files when a module is loaded. Is this somehow about trying to avoid compiling? That shouldn't be a huge performance hit, since python byte code is not highly optimized.

@ChristophWWagner

ChristophWWagner commented 4 years ago

We want to be able to fingerprint python code without beeing too affected by refactoring. Since the project focuses on integrity and reproducibility, we need to consider the source code for each processing step to be volatile as well. Therefore we need to detect whenever changes were made to that code and rebuild everything that depends on it.

From the single-responsibility-principle, it looks like it is the best solution to handle the actual step source just like any other resource and let the Fridge do all the integrity and rebuilding decisions based on the resource of that code.

One could just use the .py file here, however since we do not really want to rebuild almost everything when just things like timestamps, code formatting or -comments changes, the idea is to let the StepPython class handle the specifics as it is a specialized version of Item by handling step scripts as python bytecode files (which do not contain comments or formatting anymore). This should increase robustness against automated text representation modifiers greatly, such as git, black or others.

pegro commented 4 years ago

I see. But the pyc still cotains a timestamp of the source file, so you would need to look at the actual code object stored inside. Also note that, as you said comments are omitted for pyc files, but dosctrings are still very much part of the code, to allow access to __doc__.

So either you make sure you omit the first 8 bytes (4 bytes python version dependent magic code + 4 bytes timestamp of source code) of a pyc file when hashing/fingerprinting. Or you find other means to do it.

Also careful when you just look at the bytecode stored in a code object (co_code). It doesn't contain constants or variable names, which might change the behavior of the code drastically.

So I'm not sure of how many files we talk here, but it might be worth looking at the ast module, which allows access to the AST after parsing the python code. Based on that you could find a way to detect changes. I found a discussion on a mailing list using that approach.