This PR adds functionality for selecting the best trained embedding in Proto-X.
It extends previous functionality under the protox embedding train module.
Summary: Embedding analysis + selection (steps 3 and 4 of the "embedding" stage of Proto-X), now works as a single program without crashing. "As a single command" is significant because this previously involved 4 Python scripts and 3 shell scripts. Now, all of these are just Python functions runnable with a single command.
Demo: pat_test.sh does a fast run without crashing (see video). Note that pat_test.sh only contains a single invocation of task.py. After the run, configs, dependencies, and results automatically appear in the run_*/ folder (see image).
I changed my mind about how to handle currently unused Proto-X CLI args. In the previous PR, I got rid of any CLI args that weren't actively being used. Here, I realized that it's better to consult Will about which CLI args to keep and which to remove at a later time.
analyze.txt was renamed to ranges.txt to avoid confusion with the "analysis" phase that previously involved both stats.txt and analyze.txt
I decided to combine training, analysis, and selection into the same "task run" so that the "run_*/" dir would have a clean output: a few .pth files. If training and analysis were separate "task runs", a lot of code would need to be rewritten about how stats.txt and ranges.txt are generated. Also, conceptually, the user only cares about the final .pth file.
I combined the start_epoch args used when generating stats.txt and ranges.txt
I am enforcing a single "part" for now during the analysis phase. This is because we need some more engineering to spawn multiple processes from which the Python program since Python doesn't easily allow multi-threading.
The minimal set of files from the Proto-X repository needed for embedding analysis and selection was migrated over.
This PR adds functionality for selecting the best trained embedding in Proto-X. It extends previous functionality under the
protox embedding train
module.Example invocation:
Summary: Embedding analysis + selection (steps 3 and 4 of the "embedding" stage of Proto-X), now works as a single program without crashing. "As a single command" is significant because this previously involved 4 Python scripts and 3 shell scripts. Now, all of these are just Python functions runnable with a single command.
Demo:
pat_test.sh
does a fast run without crashing (see video). Note thatpat_test.sh
only contains a single invocation oftask.py
. After the run, configs, dependencies, and results automatically appear in therun_*/
folder (see image).https://github.com/cmu-db/dbgym/assets/20631215/b0d88243-9ada-4552-b870-d8d2c4b033fb
Details:
analyze.txt
was renamed toranges.txt
to avoid confusion with the "analysis" phase that previously involved bothstats.txt
andanalyze.txt
.pth
files. If training and analysis were separate "task runs", a lot of code would need to be rewritten about howstats.txt
andranges.txt
are generated. Also, conceptually, the user only cares about the final .pth file.start_epoch
args used when generatingstats.txt
andranges.txt