Summary: task.py ... embedding datagen ... (step 1 of the "embedding" stage of Proto-X), now works without crashing. It also updates the tpch_embedding_traindata.parquet file in [workspace]/data/ so that you can run task.py ... embedding train ... immediately after without any extra configuration.
Demo: In the demo, we start from a "clean slate system" (where [workspace]/data/ does not have tpch_embedding_traindata.parquet. We successfully run task.py ... embedding datagen ... and then task.py ... embedding train ... immediately after with no intermediate steps in between.
I added a new utility function called link_result() which creates a symlink in [workspace]/data to a file generated inside a [workspace]/task_runs/run_*/ dir
I chose to move the logic that turns the data dir into an out.parquet file into this step instead of train. I felt it was cleaner for the out.parquet file to be the "result" rather than the entire directory
I renamed the old sample_limit to file_limit and the old batch_limit to sample_limit to better reflect what they do
I removed the "--tables" arg so we always do all tables
I changed how sample_limit (as in the old batch_limit) is specified. There is now an int called default_sample_limit and a comma-separated "dictionary" called override_sample_limit which specifies whether we should override the sample limits of specific tables.
In this PR, I assume Postgres is already set up. A future PR will manage Postgres installation + building, pg_ctl start/stop, ports, pgdata, [benchmark].tgz files
There are 3 bugs that sometimes happen in the task.py ... embedding train ... step. I documented these in the issues of cmu-db/dbgym
This PR adds functionality for generating the training data that Proto-X needs. It introduces the
protox embedding datagen
module.Example invocation:
Summary:
task.py ... embedding datagen ...
(step 1 of the "embedding" stage of Proto-X), now works without crashing. It also updates thetpch_embedding_traindata.parquet
file in[workspace]/data/
so that you can runtask.py ... embedding train ...
immediately after without any extra configuration.Demo: In the demo, we start from a "clean slate system" (where
[workspace]/data/
does not havetpch_embedding_traindata.parquet
. We successfully runtask.py ... embedding datagen ...
and thentask.py ... embedding train ...
immediately after with no intermediate steps in between.https://github.com/cmu-db/dbgym/assets/20631215/4d7c40d6-0cd0-441f-ab83-cd4c272443c3
Details
link_result()
which creates a symlink in[workspace]/data
to a file generated inside a[workspace]/task_runs/run_*/
dirout.parquet
file into this step instead of train. I felt it was cleaner for theout.parquet
file to be the "result" rather than the entire directorysample_limit
tofile_limit
and the oldbatch_limit
tosample_limit
to better reflect what they dosample_limit
(as in the oldbatch_limit
) is specified. There is now an int calleddefault_sample_limit
and a comma-separated "dictionary" calledoverride_sample_limit
which specifies whether we should override the sample limits of specific tables.task.py ... embedding train ...
step. I documented these in the issues of cmu-db/dbgym