LineaLabs / lineapy

Move fast from data science prototype to pipeline. Capture, analyze, and transform messy notebooks into data pipelines with just two lines of code.
https://lineapy.org
Apache License 2.0
663 stars 58 forks source link

Quick export Jupyer Notebook artifacts without specifying artifacts #711

Open vikranth22446 opened 2 years ago

vikranth22446 commented 2 years ago

Is your feature request related to a problem? Please describe. I'd like to quickly export some Jupyter code to a local file(non airflow based) without having to specify artifacts. I don't want to save every single variable at the end(I'd like if it automatically detected it).

Describe the solution you'd like A clear api to write the output without specifying artifacts.

yifanwu commented 2 years ago

Thanks for your input, @vikranth22446!

Just to be clear, "every single variable at the end" means every explicitly assigned variable in the outer scope, i.e., of the form

a_var = ...

And not any variables in if statements, loops, functions, etc.

Thanks for clarifying!

vikranth22446 commented 2 years ago

Yes, just the exposed variables in the outermost scope would be helpful for my use case. I want to quickly add lineapy to a list of notebooks I'm using.

yifanwu commented 2 years ago

Yes, just the exposed variables in the outermost scope would be helpful for my use case. I want to quickly add lineapy to a list of notebooks I'm using.

Outermost scope makes sense! Would you mind sharing more what the end to end workflow is? Specifically, once you clean up these notebooks, what do you plan to do next? This would help give us a sense of user scenarios for prioritization.

Thanks!

vikranth22446 commented 2 years ago

One sample workflow I'm doing for a ML project:

  1. 1 job generates pickles of all the dataloaders in 1 jupyter notebook
  2. Take those pickles and process them in a separate jupyter notebook to run experiments
  3. The experiments are run via subprocced slurm calls
  4. Separate notebook to interpret the output of those results

For steps such as pickle generation, I'd like if linea saved an output file that I could hook to.

yifanwu commented 2 years ago

Sorry for the delay in response!

Thanks for sharing your workflow. I have a couple followups for your use case.

  1. If we followed your suggestion to save all the open variables, we would eagerly pickle all the variables and that might be a storage hit to your machine (e.g., if one is a large data frame), and make our artifact catalog rather bloated as well. a. Is that an UX that's OK with you? b. To clarify, would you use LineaPy's artifact store, or would you prefer to manage your own pickles? c. Alternatively, you could also extract code that writes to the file system (assuming you've already pickled the variables), or provide the variables of interest via command line, would that be something you'd use instead?
  2. For us to work with past notebooks (that were ran without lineapy), we'd have to actually execute them again, would that be OK with your workflow?
vikranth22446 commented 2 years ago
  1. a) I'd be okay with saving all variables to the artifact store. However, for me, a better experience would a be an artifact store that cleans up after the export. At the end of the file, I'd like to quickly export the code. I care more about the clean code than the variables itself. b) These pickles are generated using an external subprocess. The pickles are not the variables in the current jupyter notebook, but a separate file. c) I don't know what this would look like. Do you have a snippet of what this would look like?

  2. I'm fine with re-executing them. I'll just be reducing the iteration size in that case. The biggest benefit is the generation of a file that I can use for other executions.
Vtalike commented 6 months ago

Is your feature request related to a problem? Please describe.

I'd like to quickly export some Jupyter code to a local file(non airflow based) without having to specify artifacts. I don't want to save every single variable at the end(I'd like if it automatically detected it).

Describe the solution you'd like

A clear api to write the output without specifying artifacts.