Closed westurner closed 2 years ago
DVC is one workflow system; it has some unique features:
dvc.yaml
file defines the pipeline: https://dvc.org/doc/user-guide/project-structure/pipelines-files#pipelines-files-dvcyamldvc.lock
file tracks the workflow state and dependency hashes: https://dvc.org/doc/user-guide/project-structure/pipelines-files#dvclock-file$ dvc metrics
https://dvc.org/doc/command-reference/metrics#metrics
https://dvc.org/doc/user-guide/experiment-management#organization-patterns
MLFlow is another ML workflow system:
This is primarily beyond the scope of this project.
Components that may or may not save work or be of additional value:
dask can run things with various schedulers:
dask-mpi
https://docs.dask.org/en/latest/deploying-hpc.html#dask-jobqueue-and-dask-drmaa
ipython nb.ipnb
or a test runner.Anyways, you've probably already looked at dask[-ml,] and {JupyterLab, code-server (VSCode)} hosted next to the data, with containers, or remotely querying and paging with DuckDB over HTTPS.