SAME-Project / same-project

https://sameproject.ml/
Apache License 2.0
20 stars 8 forks source link

Discussion: Initial Package Parsing #10

Closed aronchick closed 3 years ago

aronchick commented 3 years ago

Today, everything in notebooks execute in a global context. So, there's no concept of JUST installing a package for a step. This could be a cheap and cheerful way for us to implement package detection and parsing - just scan the WHOLE file for any imports/package usage, and attach them to every step. However, this is not terribly efficient (ideally, you'd just install the packages required for each individual step).

Should we do everything globally for now, or do the work to be more efficient?

For ex:

# + tags=["parameters"]
foo = "bar"

# +
# + tags=["same_step_1"]
import tensorflow

# +
# + tags=["same_step_2"]
import numpy

# +
# + tags=["same_step_3"]
import pytorch

Step 1 should only have tensorflow, step 2 should only have numpy and step 3 should only have pytorch. Inside Jupyter, they have them all installed globally, but that's not idea (obviously).

CodeMonkeyLeet commented 3 years ago

I'm for treating imports as global as a first step. I think aggregating imports is probably the first important piece, and it seems like there are additional problems to solve around going from imports to package installation first; for example, it seems like the problem is an insufficiently constrained problem for reproducibility without significant user input (e.g. conda vs. pip, notebook kernel version, package versions, package repository).

Step 1 should only have tensorflow, step 2 should only have numpy and step 3 should only have pytorch. Inside Jupyter, they have them all installed globally, but that's not idea (obviously).

Is that a safe inference? Imports carry across cells until kernel reboot and it's pretty common to have a bunch of imports up front in cell 1 with the assumption that there is an initial sequential run of the notebook; in even less organized notebooks, the imports may not even be in order if the author was jumping back and forth between cells.

To do cell-level targeting effectively, it feels like we would need to do language server type analysis to does to infer package use from symbols referenced in the code.

aronchick commented 3 years ago

Interestingly, I've been able to handle this through my context serialization scheme, but I think I'm with you. Let's just treat them all globally, and let folks do optimization later. WORST CASE, some containers are bigger than they should be, but in the grand scheme of total throughput, this is pretty small.

aronchick commented 3 years ago

Finished - implement package lists globally (in same program run - doing it in the SDK is separate IMO)