Optimize diagnostics for larger HPC nodes

This PR optimizes the online diagnostic calculation and report generation for use on HPC where there are typically many more cores. To achieve this I standardized some of the common aggregation functions for taking means and switched to use of dask distributed, which has more fine-grained control over workers and memory usage. Additionally, the diagnostic functions are batched to occur concurrently using joblib but the individual dask tasks are elevated to the top-level of the scheduler to properly track individual task operations and memory usage.

Refactored public API:

new flags for controlling the compute parallelism to the online prognostic run compute.py
- --scheduler-file provide a dask scheduler file for an external dask distributed cluster when doing multi-node processing
- --num-concurrent-2d-calcs number of concurrent 2D diagnostics to execute (i.e., diagnostic functions), defaults to 8
- --num-concurrent-3d-calcs number of concurrent 3D diagnostics to execute (i.e., diagnostic functions), defaults to 2

Significant internal changes:

Online diagnostics are now calculated using a dask distributed cluster with progress visible via localhost:8787 and parallelized using batches of function calls with joblib
Generated diagnostic plots has been parallelized using joblib
Diurnal cycle now uses flox to speed up the dask groupby operation
load_scream_data now separates to physics-only tendencies for due_to_scream_physics variables, whereas previously, it included both the nudging/ML update
- load_scream_data now removes the ML precipitation from PRATEsfc since the diagnostics expects this to be a physics only quantity

Requirement changes:

Bulleted list, if relevant, of any changes to setup.py, requirement.txt, environment.yml, etc
[ ] Ran make lock_deps/lock_pip following these instructions
[ ] Add PR review with license info for any additions to constraints.txt (example)
[ ] Tests added

Coverage reports (updated automatically):

ai2cm / fv3net

Optimize diagnostics for larger HPC nodes #2396