Allow scripts to be run instead of notebooks

NCAR / CUPiD

CUPiD is a “one stop shop” that enables and integrates timeseries file generation, data standardization, diagnostics, and metrics from all CESM components.

https://ncar.github.io/CUPiD/

Apache License 2.0

21 stars 19 forks source link

Allow scripts to be run instead of notebooks #93

Open TeaganKing opened 2 months ago

TeaganKing commented 2 months ago

We'd like to be able to run a script rather than just a notebook with CUPiD.

A good test of this would be to run a sea ice script instead of a sea ice notebook.

This will require saving figures generated from the notebook and reading them in with a standard naming convention to an html file that could show the figures. See #16 for the html component of this.

mnlevy1981 commented 2 months ago

@dabail10 started to work on this and ran into a major stumbling block -- the ploomer script runner doesn't offer an obvious way to change environments before running the script. Dave's example was being run in cupid-dev, which doesn't have the necessary analysis packages installed. Lev will dig into the ploomer documentation and then I'll work with Dave to get a fix in place.

rmshkv commented 2 months ago

As a first pass, having looked through the Ploomber documentation, it doesn't look like there's a supported way to specify an environment for the ScriptRunner task. I'd recommend starting with looking at the documentation for ScriptRunner. If we wanted to go the route of putting in a PR to Ploomber or implementing our own ScriptRunner class, it actually looks like it might be a pretty straightforward update to their code, but that might be out of scope, in which case we should put together a more hacky solution in our own code.

TeaganKing commented 1 month ago

Just for reference, the function which includes ScriptRunner is create_ploomber_script_task.

In the ScriptRunner documentation, it looks like we should be able to specify the environment in the configuration file using something like this, adapted for the script that you have (and I don't entirely remember the structure that you had in the config file thus far):

tasks:
  - source: script.py
  - product: output.txt
  - env: cupid-analysis

So, specifying 'env' in an additional line should ideally work to allow for using the cupid-analysis environment. Maybe we can test it out tomorrow?

rmshkv commented 1 month ago

Hmm, maybe I missed something - could you point me to where in the documentation or Ploomber source code that feature is mentioned?

TeaganKing commented 1 month ago

Apologies, I may have jumped ahead without enough research on this... I saw this in an example, but after another look, I'm now thinking that the example was not accurate and it might not actually work in the format that I commented in the code snippet above. @rmshkv did you already have a particular place in the Ploomber code you were looking at to update?

TeaganKing commented 1 month ago

It looks like the kernelspec_name in NotebookRunner is set roughly here. Perhaps we could add an environment to this part of ScriptRunner?

Or, if we prefer to do something from our own side, could we possibly try something like this from within the script? I'm not sure what the best method is, but wanted to toss out some ideas.

import subprocess
subprocess.run("conda activate cupid-analysis", shell=True, check=True, executable='/bin/bash')

For reference, I'm including the NotebookRunner documentation and ScriptRunner documentation.

TeaganKing commented 1 month ago

From @mnlevy1981 , we could also possibly use something like os.system("conda run -n cupid-analysis python script.py") in run.py to replace line 254 (call to create_ploomber_script_task.

rmshkv commented 1 month ago

Some rough thoughts: I haven't tried any of these particular solutions, but in general, I think the first attempt should be along the lines of modifying Ploomber's code or implementing our own script runner task code if we don't want to contribute to that code base. It'd be nice to keep the interface of specifying the environment in config.yml rather than having to add code to the script itself, and it would also be best to keep the script running mechanism as a Ploomber task, since that task creation step isn't actually where the script code is being run (that happens later when the DAG is built, which takes into account dependencies between tasks if those exist, including between script and notebook tasks).