This PR adds functionality for using Proto-X to build an embedding from training data.
It assumes various dependencies that will land in subsequent PRs (e.g., training data, workload).
Summary: Embedding training (step 2 of the "embedding" stage of Proto-X), now works without crashing using only config files inside the dbgym repo and dependencies from the dbgym_workspace directory.
Demo: pat_test.sh does a fast run without crashing (see video). Note that pat_test.sh only contains a single invocation of task.py. After the run, configs, dependencies, and results automatically appear in the run_*/ folder (see image).
The minimal set of files from the Proto-X repository needed for embedding training was migrated over.
The open_and_save() abstraction automatically opens either configs or dependencies and saves them inside the run_*/ folder. It handles a wide variety of cases including resolving relative paths and symlinks, distinguishing between configs and dependencies, and symlinking either files or folders intelligently for dependencies.
Unnecessary CLI args the original Proto-X exposed were removed and sensible defaults were added for the remaining CLI args. Sensible defaults for "path" CLI args point to a path in the [workspace]/data/ directory.
Ray is now restarted in code instead of manually by the user.
Refactoring was not done (for instance, significant parts of default_tpch_config.yaml will likely be removed in the future, but that will be a future PR).
Other generic parts of the system not immediately needed were not done. For instance, the git commit hash and invocation command are not currently being saved to the run_*/ folder.
This PR adds functionality for using Proto-X to build an embedding from training data. It assumes various dependencies that will land in subsequent PRs (e.g., training data, workload).
Example invocation:
Summary: Embedding training (step 2 of the "embedding" stage of Proto-X), now works without crashing using only config files inside the dbgym repo and dependencies from the dbgym_workspace directory.
Demo:
pat_test.sh
does a fast run without crashing (see video). Note thatpat_test.sh
only contains a single invocation oftask.py
. After the run, configs, dependencies, and results automatically appear in therun_*/
folder (see image).https://github.com/cmu-db/dbgym/assets/20631215/0b33f4a4-eb2e-479e-b206-dee421fb7e63
Details:
open_and_save()
abstraction automatically opens either configs or dependencies and saves them inside therun_*/
folder. It handles a wide variety of cases including resolving relative paths and symlinks, distinguishing between configs and dependencies, and symlinking either files or folders intelligently for dependencies.[workspace]/data/
directory.default_tpch_config.yaml
will likely be removed in the future, but that will be a future PR).run_*/
folder.