LineaLabs / lineapy

Move fast from data science prototype to pipeline. Capture, analyze, and transform messy notebooks into data pipelines with just two lines of code.
https://lineapy.org
Apache License 2.0
664 stars 58 forks source link

Pseudo-random generator prevents lineapy from capturing all relevant code #885

Open VolodymyrOrlov opened 1 year ago

VolodymyrOrlov commented 1 year ago

python version: what python version are you using? 3.8 lineapy version what version of lineapy are you using or which commit if installed from source? 0.2.3

Your code: What code did you try to run with lineapy?

# Cell 1
import random

my_init_value = 3
my_var = None
if random.random() <= 0.5:
    my_var = my_init_value
else:
    my_var = 1
print(my_var)
# Cell 2
lineapy.save(my_var, "my_var")
# Cell 3
print(lineapy.get("my_var").get_code())

*Issue: What went wrong when trying to run this code?* The last cell prints code that has been captured by lineapy:

import random

if random.random() <= 0.5:
    my_var = my_init_value
else:
    my_var = 1

This code is not self sufficient. What will happen if my_init_value is a complex function or a very important hyper parameter that has a great influence on model result?

import random

my_init_value = 3
my_var = None
if random.random() <= 0.5:
    my_var = my_init_value
else:
    my_var = 1
dorx commented 1 year ago

Thanks for filing the bug, @VolodymyrOrlov ! Our support for control flows is experimental. We will look into this issue.

aayan636 commented 11 months ago

Hi @VolodymyrOrlov , the issue you are facing is related to LineaPy's support for control flow structures, which as @dorx mentioned is experimental at this stage. A bit of background: LineaPy relies on dynamic analysis of your program to figure out dependencies between different lines of code. What that means is LineaPy executes your program and analyses the interaction between different objects to create the Linea Graph which is further processed to generate the cleaned up version of the original code. In your example, the condition of the if statement is random.random() <= 0.5, which can either be true or false depending on the value taken by random.random() at runtime. In case the condition evaluates to false, the else branch would be taken, and my_var's final value would not depend on my_init_value. When it comes to program's with a control flow statement, LineaPy's slicing would be an overapproximation, it would include all lines of code in the entire if/else block, hence the entire code block gets included.

In your example, if you run the code again (till the true branch gets taken) you would notice that the my_init_value = 3 line would get included.