SAME-Project / same-project

https://sameproject.ml/
Apache License 2.0
19 stars 8 forks source link

Functions can't refer to other functions #69

Closed lukemarsden closed 2 years ago

lukemarsden commented 2 years ago
def foo():
  pass

def bar():
  foo()

bar()

doesn't work when translating from notebook -> pipeline

Bubblyworld commented 2 years ago

Think I'm starting to work out what's going on here. Basically we aggregate the user's code blocks for each step into a python script that's run with exec(__code, __variables_to_mount, __loc), where __variables_to_mount extracts local variables from the previous pipeline step to simulate the name resolution rules in Jupyter notebooks, and __loc is just an empty namespace (see the kubeflow jinja templates for context). According to the docs for exec:

If globals and locals are given, they are used for the global and local variables, respectively. If provided, locals can be any mapping object. Remember that at the module level, globals and locals are the same dictionary. If exec gets two separate objects as globals and locals, the code will be executed as if it were embedded in a class definition.

Since we're giving both a locals and a globals map in our call, the code is executed as if it's in the scope of a class definition. But class definitions have special name resolution rules:

Class definition blocks and arguments to exec() and eval() are special in the context of name resolution. [...] The scope of names defined in a class block is limited to the class block; it does not extend to the code blocks of methods – this includes comprehensions and generator expressions since they are implemented using a function scope.

In other words, I think these bugs are happening for the same reason something like this doesn't work:

class A:
  def a(self):
    pass

  def b(self):
    a()

x = A()
x.b()
Traceback (most recent call last):
  File "/dev/fd/63", line 9, in <module>
  File "/dev/fd/63", line 6, in b
NameError: name 'a' is not defined
Bubblyworld commented 2 years ago

Haven't thought of a plan to actually fix this yet (while keeping multistep pipelines working - we still need to extract the locals from a step and inject them into the next step). The idea of inlining the user's code directly instead of using exec() could work, but extracting the right locals might be tricky (we have a bunch of other variables/functions defined in the jinja templates, for instance).

lukemarsden commented 2 years ago

Interesting, excellent digging! Maybe we can just merge new locals into the next step's globals..? And not pass two dictionaries in?

aronchick commented 2 years ago

Hi! Sorry I would have been more than happy to walk through stuff to save you time :)

Luke had a good idea to template the users code into temp file and the import it at the point here - then all we have to inline with the eval of the context from the previous step. I suspect that importing will be more inline with the class rules you quote earlier, but honestly no idea until we test. It also feels like a much cleaner way but would love your thoughts!

Bubblyworld commented 2 years ago

Hey, no worries, software archaeology is an old hobby! =P

Yeah, so as far as I can see both options work - exec(code, ctx) will run the code in a fresh execution frame where locals() = globals() = ctx, which I think is exactly the same as importing a module where ctx has been inlined into globals().

The exec option probably makes dealing with context easier though, so I'll give that a try in the morning 🙏