NCAR / CUPiD

CUPiD is a “one stop shop” that enables and integrates timeseries file generation, data standardization, diagnostics, and metrics from all CESM components.
https://ncar.github.io/CUPiD/
Apache License 2.0
21 stars 19 forks source link

Work through engineering concerns #31

Closed TeaganKing closed 4 months ago

TeaganKing commented 7 months ago

In the bare-bones deployment, we need to be cognizant of engineering concerns, such as the following:

rmshkv commented 6 months ago

To elaborate on considerations for computational resource allocation:

rmshkv commented 6 months ago

And some more thoughts on environments:

Currently, notebooks are by default run in the environment specified bydefault_kernel_name under computation_config in config.yml. Each individual notebook can also specify its own environment, under the key kernel_name in its entry under compute_notebooks. These environments must already be installed on the user's machine for this to work (this is checked before the notebooks are run).

An idea that was floated at one point was having the notebooks run in the active environment by default (see https://github.com/rmshkv/nbscuid/issues/24).

Another consideration is which environment nbscuid (or whatever we end up calling the main run engine) is installed in vs. which environment the notebooks need to run in. I've been keeping these separate, but in the future it would probably be best to have one common environment that contains all the necessary analysis packages as well as nbscuid to minimize setup steps and confusion.

mnlevy1981 commented 5 months ago

For parallelization, I think it would be useful to require users to request all compute resources ahead of time rather than having each notebook add additional jobs to the queue. To achieve that, we probably want to use LocalCluster objects inside every notebook (and specify in config.yml how big the local cluster should be). So the workflow on the NCAR machine would be "request N cores on casper to run cupid-run, and then have the notebooks use some of those cores as dask workers."

It might be the case that the maximum size for the local cluster is N-2 when submitting a job on N cores; in dask-mpi, one core is reserved to actually run the python code, a second one is reserved for the dask task manager, and then the rest of the cores can be workers. I suspect we will have to look at timing numbers and play with the configuration some if we go this route.

mnlevy1981 commented 4 months ago

We've been providing yaml files in environments/ for a while now, and #61 introduced LocalCluster into a couple of notebooks (we also settled on the serial ploomber executor for now)