Work through engineering concerns

NCAR / CUPiD

CUPiD is a “one stop shop” that enables and integrates timeseries file generation, data standardization, diagnostics, and metrics from all CESM components.

https://ncar.github.io/CUPiD/

Apache License 2.0

21 stars 19 forks source link

Work through engineering concerns #31

Closed TeaganKing closed 4 months ago

TeaganKing commented 7 months ago

In the bare-bones deployment, we need to be cognizant of engineering concerns, such as the following:

Common Environment
Computer resources allocation (esp. using dask in multiple notebooks at once)

rmshkv commented 6 months ago

To elaborate on considerations for computational resource allocation:

Currently, each notebook is responsible for spinning up its own Dask cluster if necessary, including supplying the appropriate project keys, etc.
Ploomber runs notebooks in parallel using its parallel executor, so multiple clusters are being spun up at the same time. It's not super clear to me what's actually happening behind the scenes here, and memory usage spikes very high while the notebooks are being executed.
Ploomber also has a serial executor that's super simple to drop in and replace.
In a past iteration of nbscuid (saved in this branch), a single Dask cluster was spun up by the run engine (run.py), and the cluster scheduler address was passed to each notebook as a parameter so they could each connect to it. Notebooks were being run in serial through Papermill (without the additional Ploomber layer), so only one notebook was connecting to the cluster at a time. This was probably more efficient for memory usage, and reduced the time to wait for each cluster to be spun up, but executing in serial probably isn't feasible for large diagnostic suites.
This definitely needs to be explored and tested further so we can come up with a stable and efficient solution!

rmshkv commented 6 months ago

And some more thoughts on environments:

Currently, notebooks are by default run in the environment specified bydefault_kernel_name under computation_config in config.yml. Each individual notebook can also specify its own environment, under the key kernel_name in its entry under compute_notebooks. These environments must already be installed on the user's machine for this to work (this is checked before the notebooks are run).

An idea that was floated at one point was having the notebooks run in the active environment by default (see https://github.com/rmshkv/nbscuid/issues/24).

Another consideration is which environment nbscuid (or whatever we end up calling the main run engine) is installed in vs. which environment the notebooks need to run in. I've been keeping these separate, but in the future it would probably be best to have one common environment that contains all the necessary analysis packages as well as nbscuid to minimize setup steps and confusion.

mnlevy1981 commented 5 months ago

For parallelization, I think it would be useful to require users to request all compute resources ahead of time rather than having each notebook add additional jobs to the queue. To achieve that, we probably want to use LocalCluster objects inside every notebook (and specify in config.yml how big the local cluster should be). So the workflow on the NCAR machine would be "request N cores on casper to run cupid-run, and then have the notebooks use some of those cores as dask workers."

It might be the case that the maximum size for the local cluster is N-2 when submitting a job on N cores; in dask-mpi, one core is reserved to actually run the python code, a second one is reserved for the dask task manager, and then the rest of the cores can be workers. I suspect we will have to look at timing numbers and play with the configuration some if we go this route.

mnlevy1981 commented 4 months ago

We've been providing yaml files in environments/ for a while now, and #61 introduced LocalCluster into a couple of notebooks (we also settled on the serial ploomber executor for now)