BLD,ENH: Dask-scheduler (SLURM,), BentoML (FastAPI), JupyterLite, Ludwig (Dask, MLFlow,)

westurner commented 2 years ago

Components that may or may not save work or be of additional value:

dask can run things with various schedulers:
- Python API
- Command Line [run locally]
- SSH
- Docker Images
- High Performance Computers
- dask-mpi
- dask-jobqueue provides cluster managers for PBS, SLURM, LSF, SGE and other resource managers.
  
  https://docs.dask.org/en/latest/deploying-hpc.html#dask-jobqueue-and-dask-drmaa
- Kubernetes w/ and w/o Helm
- Cloud
dask-ml
- Scikit-Learn & Joblib
- XGBoost & LightGBM
- PyTorch
- Keras and Tensorflow
- https://github.com/dask/dask-labextension
- JupyterLab extension for managing dask.distributed clusters and their dashboards
https://github.com/tiangolo/fastapi (author of Django REST Framework)
https://github.com/bentoml/BentoML hosts ML models from many ML training frameworks
- https://docs.bentoml.org/en/latest/quickstart.html
- https://docs.bentoml.org/en/latest/frameworks/index.html
https://github.com/jupyterlite/jupyterlite builds SciPy stack and whatever you add to the build into WASM which runs entirely in browser
- Use case: display GFS outputs in JupyterLite mostly or entirely clientside, in a browser tab
- "Hosting SQLite Databases on GitHub Pages" (2021) re: sql.js-httpvfs, DuckDB https://news.ycombinator.com/item?id=28021766
- https://duckdb.org/docs/guides/python/jupyter
  - Arrow is the fastest. Parquet is a good storage format.
Jupyter:
- https://github.com/Kaggle/docker-python/blob/main/Dockerfile.tmpl
- Use case: Have Kaggle host a GFS competition
- https://github.com/ml-tooling/ml-workspace includes JupyterLab and also code-server (VSCode) and SSH, in Supervisor, in one container
- https://github.com/ml-tooling/ml-workspace/blob/main/Dockerfile
- Notebooks aren't roughly-immutable lab notebooks, they have prompt numbers and you have to 'Restart and Run All' or execute with ipython nb.ipnb or a test runner.
- "Keeping a Lab Notebook [PDF]" https://westurner.github.io/hnlog/#comment-15710815
- CoCalc has a time slider for notebooks and LaTeX documents.
- JupyterHub has Spawners and Authenticators, and can run locally next to data
- binderhub ( https://mybinder.org/ )
  - repo2docker generates a container from a git repo with latest JupyterLite on top
  - REES: Reproducible Execution Environment Specification
FWIW, for weather chaotic fluid systems, too
- https://github.com/tequilahub/tequila (Cirq implements qubits with SymPy e.g. for TensorFlow Quantum)
- re: QG quantum gravity and fluids: https://news.ycombinator.com/item?id=31383784 https://westurner.github.io/hnlog/#comment-31383784
- TIL about superfluid quantum gravity and Bernoulli's and vorticity (and probably also https://en.wikipedia.org/wiki/Quantum_chaos applications to these possibly-adaptive complex nonlinear dynamical systems, too)

Anyways, you've probably already looked at dask[-ml,] and {JupyterLab, code-server (VSCode)} hosted next to the data, with containers, or remotely querying and paging with DuckDB over HTTPS.

westurner commented 2 years ago

DVC is one workflow system; it has some unique features:

DVC core features: https://dvc.org/doc/user-guide/what-is-dvc#core-features
dvc.yaml file defines the pipeline: https://dvc.org/doc/user-guide/project-structure/pipelines-files#pipelines-files-dvcyaml
dvc.lock file tracks the workflow state and dependency hashes: https://dvc.org/doc/user-guide/project-structure/pipelines-files#dvclock-file
external dependency tracking https://dvc.org/doc/user-guide/external-dependencies#how-external-dependencies-work
per-branch metric tracking https://dvc.org/doc/start/data-management/metrics-parameters-plots#get-started-metrics-parameters-and-plots $ dvc metrics https://dvc.org/doc/command-reference/metrics#metrics https://dvc.org/doc/user-guide/experiment-management#organization-patterns
- Git tags and branches
- Directories
- Hybrid
- Labels
Language- & framework-agnostic

MLFlow is another ML workflow system:

MLFlow can log in-progress metrics with tracking for many ML frameworks: https://www.mlflow.org/docs/latest/tracking.html
MLFlow + dask: https://www.google.com/search?q=%22mlflow%22+%22dask%22
- https://towardsdatascience.com/use-mlflow-and-dvc-for-open-source-reproducible-machine-learning-2ab8c0678a94
- https://github.com/PeterFogh/dvc_dask_use_case
- TIL about a new #MLops product: https://dagshub.com/docs/index.html
- And another: ludwig (A LFAI Linux Foundation AI & Data project)
  - Web: https://ludwig.ai/
  - Src: https://github.com/ludwig-ai/ludwig
  - Docs: https://ludwig-ai.github.io/ludwig-docs/latest/
  - "Ludwig AI v0.4 – Introducing Declarative MLOps with Ray, Dask, TabNet, and MLflow integrations" (2021) https://lfaidata.foundation/blog/2021/06/16/ludwig-ai-v0-4-introducing-declarative-mlops-with-ray-dask-tabnet-and-mlflow-integrations/

WalterKolczynski-NOAA commented 2 years ago

This is primarily beyond the scope of this project.

NOAA-EMC / global-workflow

BLD,ENH: Dask-scheduler (SLURM,), BentoML (FastAPI), JupyterLite, Ludwig (Dask, MLFlow,) #796