cmu-db / dbgym

An end-to-end research vehicle for studying self-driving DBMSs.
MIT License
14 stars 1 forks source link

feat: add support for generating training data (via HypoPG) to train embeddings in Proto-X. #7

Closed wangpatrick57 closed 7 months ago

wangpatrick57 commented 7 months ago

This PR adds functionality for generating the training data that Proto-X needs. It introduces the protox embedding datagen module.

Example invocation:

python task.py --no-startup-check protox embedding datagen tpch --connection-str "host=localhost port=5432 dbname=benchbase user=admin" --override-sample-limits "lineitem,32768"

Summary: task.py ... embedding datagen ... (step 1 of the "embedding" stage of Proto-X), now works without crashing. It also updates the tpch_embedding_traindata.parquet file in [workspace]/data/ so that you can run task.py ... embedding train ... immediately after without any extra configuration.

Demo: In the demo, we start from a "clean slate system" (where [workspace]/data/ does not have tpch_embedding_traindata.parquet. We successfully run task.py ... embedding datagen ... and then task.py ... embedding train ... immediately after with no intermediate steps in between.

https://github.com/cmu-db/dbgym/assets/20631215/4d7c40d6-0cd0-441f-ab83-cd4c272443c3

Details