feat: add support for generating training data (via HypoPG) to train embeddings in Proto-X.

This PR adds functionality for generating the training data that Proto-X needs. It introduces the protox embedding datagen module.

Example invocation:

python task.py --no-startup-check protox embedding datagen tpch --connection-str "host=localhost port=5432 dbname=benchbase user=admin" --override-sample-limits "lineitem,32768"

Summary: task.py ... embedding datagen ... (step 1 of the "embedding" stage of Proto-X), now works without crashing. It also updates the tpch_embedding_traindata.parquet file in [workspace]/data/ so that you can run task.py ... embedding train ... immediately after without any extra configuration.

Demo: In the demo, we start from a "clean slate system" (where [workspace]/data/ does not have tpch_embedding_traindata.parquet. We successfully run task.py ... embedding datagen ... and then task.py ... embedding train ... immediately after with no intermediate steps in between.

https://github.com/cmu-db/dbgym/assets/20631215/4d7c40d6-0cd0-441f-ab83-cd4c272443c3

Details

I added a new utility function called link_result() which creates a symlink in [workspace]/data to a file generated inside a [workspace]/task_runs/run_*/ dir
I chose to move the logic that turns the data dir into an out.parquet file into this step instead of train. I felt it was cleaner for the out.parquet file to be the "result" rather than the entire directory
I renamed the old sample_limit to file_limit and the old batch_limit to sample_limit to better reflect what they do
I removed the "--tables" arg so we always do all tables
I changed how sample_limit (as in the old batch_limit) is specified. There is now an int called default_sample_limit and a comma-separated "dictionary" called override_sample_limit which specifies whether we should override the sample limits of specific tables.
In this PR, I assume Postgres is already set up. A future PR will manage Postgres installation + building, pg_ctl start/stop, ports, pgdata, [benchmark].tgz files
There are 3 bugs that sometimes happen in the task.py ... embedding train ... step. I documented these in the issues of cmu-db/dbgym

cmu-db / dbgym

feat: add support for generating training data (via HypoPG) to train embeddings in Proto-X. #7