Open ChristophWWagner opened 4 years ago
This might be a good source to start with.
He links the following website in the DA which might be of use.
Compiling should be relatively easy with py_compile or compileall
Importing pyc files should gereally be possible just like py files, but pyc files are highly specific to OS, python version, architecture, etc. For importing pyc files from others may decompilers like uncompyle6 or decompyle3 should be used to decompyle pyc and rather import the generated py files
yes, pyc files are very short-lived, depending mostly on the local version of python. Since in the scope of a chefkoch project, the python version should be the same, this should not be a hindering factor.
However, maybe it is a good idea to also not forget what the original source file was. Just in case we later on might come across a situation where we need to handle special cases from version hell. Good point! I'll take this into consideration
May I ask what the use case is here? The Python interpreter will automatically create those files when a module is loaded. Is this somehow about trying to avoid compiling? That shouldn't be a huge performance hit, since python byte code is not highly optimized.
May I ask what the use case is here? The Python interpreter will automatically create those files when a module is loaded. Is this somehow about trying to avoid compiling? That shouldn't be a huge performance hit, since python byte code is not highly optimized.
@ChristophWWagner
We want to be able to fingerprint python code without beeing too affected by refactoring. Since the project focuses on integrity and reproducibility, we need to consider the source code for each processing step to be volatile as well. Therefore we need to detect whenever changes were made to that code and rebuild everything that depends on it.
From the single-responsibility-principle, it looks like it is the best solution to handle the actual step source just like any other resource and let the Fridge
do all the integrity and rebuilding decisions based on the resource of that code.
One could just use the .py file here, however since we do not really want to rebuild almost everything when just things like timestamps, code formatting or -comments changes, the idea is to let the StepPython
class handle the specifics as it is a specialized version of Item
by handling step scripts as python bytecode files (which do not contain comments or formatting anymore). This should increase robustness against automated text representation modifiers greatly, such as git, black or others.
I see.
But the pyc still cotains a timestamp of the source file, so you would need to look at the actual code object stored inside.
Also note that, as you said comments are omitted for pyc files, but dosctrings are still very much part of the code, to allow access to __doc__
.
So either you make sure you omit the first 8 bytes (4 bytes python version dependent magic code + 4 bytes timestamp of source code) of a pyc file when hashing/fingerprinting. Or you find other means to do it.
Also careful when you just look at the bytecode stored in a code object (co_code). It doesn't contain constants or variable names, which might change the behavior of the code drastically.
So I'm not sure of how many files we talk here, but it might be worth looking at the ast
module, which allows access to the AST after parsing the python code. Based on that you could find a way to detect changes. I found a discussion on a mailing list using that approach.
Find out how to compile any python file (given by filename) into a python compiled bytecode container (.pyc) and how to import these .pyc (that may also lay outside a module) into the python namespace as an object.
There shall be functions that