A TPCH trace with 100K data points (released). The query plan information is maintained in a graph data structure, while other features and objectives are stored in a tabular DataFrame.
A graph structure has two types of features for a query plan: graph topology and operator features.
When the operator feature is just the operator type, many data points have the same graph features. Therefore, we saved all distinct graph topologies in a DGLGraph list and indexed them with dgl_id in the tabular data.
A dataset with more sophisticated operator features:
When the operator features go beyond the operator type, data points become more diverse in the feature space. However, saving all the distinct graph structures is not memory-friendly. Therefore, we separate the storage of the graph topologies and operator features into three steps:
We first maintain all the distinct graph structures in a DGLGraph list.
For each graph structure (g), we store all the operator features of g in a 3D array node_feat_group with the dimensions as follows: [# of data points with the structure g, g.number_of_nodes, dimension of the operator features].
We then define the 'CH1_FEATS': ['dgl_id', 'vid'], where dgl_id represents the index of the graph topology and vid represents the index of the data point in the node_feat_group.
Separate code files to construct graph data and tabular data from the raw traces. However, this part can be dropped in our Python library since different datasets may require different construction methods. Instead, we plan to use the same dataset structure to unify the training and optimization processes.
Model
We have implemented various models' architecture in PyTorch, including MLP, GTN, GCN, GATv2, and more. Find these implementations here.
We have designed a complete pipeline for model training, which includes the following steps:
Model setup
Data setup
Training component setup (e.g., training optimizer, log, loss function, etc.)
Currently, the model submodule in this MOO module is isolated from our modeling part. One solution to connect them is to inherit all the built-in models from the modeling part to the BaseModel in base_model.py.
To demonstrate the capabilities of the MOO module, we provide 3-4 separate examples that include closed-form models, GPR models, and tiny neural networks.
Coding work to be done
We aim to integrate our code into a Python library called "udao," making it accessible for users to install and utilize through a simple pip install udao command. The udao library is designed to offer three core modules:
Data Processing Module (from udao import dataset)
Modeling Module (from udao import model)
Optimization Module (from udao import moo)
We summarize the coding work into three categories.
Data Processing Module
Design and implement a Dataset class to support multi-channel feature inputs with two types of data structures and necessary meta information. The Dataset class maintains
graph data, maintained in dgl.DGLGraph
The query plan in a Graph topology, e,g.,g = dgl.graph((src_ids, dst_ids))
The operator features in g.ndata["feat"]
tabular data, maintained in pandas.DataFrame
The graph id (e.g., with column ["gid", "fid"])
The input meta information (e.g., with columns ["input_records", "input_bytes"])
The machine system states (e.g., with columns ["m1", "m2"])
The configuration (e.g., with columns ["k1", "k2"])
the meta information
num_of_data: the total number of data points.
all_tfeat_cols: the column names of all tabular features.
all_ofeat_cols: the column names of the operator features.
tfeat_dict: a dict of the column names of different tabular feature types. E.g., {"I": ["col1", "col2"], "M": ["m1", "m2"], "C": ["k1", "k2"]]}
ofeat_dict: a dict of feature indices of the operator features in the graph. E.g., {"type": [0], "cbo": [1, 2], "predicates": [3, 4, 5]}
other APIs
An API to declare the features to be used for training. E.g., dataset.Dataset.declare(ofeats=["col1", "col2", "k1", "k2"], tfeats=["type", "cbo"])
An internal API to fetch the corresponding graph data given a data point in the tabular row.
Implement 2-3 built-in datasets by integrating our existing datasets.
Provide a toy example of adding a customized Dataset.
Implement an API to auto-load a built-in or customized dataset, e.g., d = dataset.load("TPCH")
Implement an API for the data preprocessing pipeline, including
train/val/test split
drop unnecessary columns
convert categorical features to the dummy vector or integer
feature augment (if needed)
feature normalization
Modeling Module
Design an abstract class ModelWrapper to provide the necessary APIs to seamlessly integrate a model with MOO.
An abstract method def initialize(self, *args) to initialize the model
An abstract method def load(self, *args) to set the model weights by either loading from the given knowledge or fitting from scratch.
(Optional) An abstract method def fit(self, dataset, loss, hps, *args) to train the model with the provided dataset, loss function, and the hyperparameters for training.
An abstract method def predict(self, obj, config, *args) to obtain the value of the target objective given a configuration
Other abstract methods if needed.
Implement built-in models, including
AVG-MLP (averaging the operator features to embed the query plan)
GTN-MLP (use GTN to embed the query plan)
Implement built-in model wrappers by integrating our built-in models.
An API to fetch a built-in ModelWrapper, e.g., m = model.fetch("udao-GTN")
Optimization Module
Refactoring the current code base to have
a class named Variables to wrap each variable in the optimization problem.
a class named Configuration to define the set of all tunable variables.
a class named Objective to define an objective and specifies the optimization direction.
a class named Constraint to define specific constraints in the optimization problem.
a class named Solution to include a configuration and the corresponding objective values (a set of objective values).
a class named ParetoOptimalSet that encompasses several Pareto-optimal solutions.
Implement a pipeline to support end-to-end optimization with the following procedures.
Define variables (supporting integer, float, a float vector, etc.)
Define objectives (including the predictive function for the objective and the optimization direction)
Define constraints
Run the optimization
Recommend Pareto-optimal solutions
Select one solution with the weighted Utopia-nearest method or the user-defined preferences.
Refactor other utils functionality for the module
the MOO recommendation methods: including weighted Utopia-nearest method or user-defined preferences.
Visualization of the Pareto-optimal solutions (2D and 3D)
The end-to-end usage of udao
Input/Output Diagram
The I/O of udao is as follows and we shall be able to use the moo module to solve the MOO problem given the dataset and user-defined optimization problem.
A Desired Example
An desired way to use moo package
from udao import dataset, model, moo
# 1. Dataset definition.
d = dataset.load("TPCH") # or declare a customized dataset
# 2. Problem definition.
# (1) define the variables inside `Configuration` based on spark_knob.json
x = moo.Configuration(meta="spark_knob.json", source=d)
# (2) define the objectives
o1 = moo.Objective(name="latency", direction="-1")
o2 = moo.Objective(name="cost", direction="-1")
# (3) define the constraints if any
c = moo.Constraint(func=[]) # in our case, there is not external constraints
# 3. Solving details
# (1) the model choice:
mw = model.fetch("udao-GTN") # fetch a built-in ModelWrapper (mw)
o1.set_predictive_function(func=mw.predict, obj_name="latency")
o2.set_predictive_function(func=mw.predict, obj_name="cost")
# (2) the algorithm and solver for MOO
moo_algo = "pf-ap"
moo_solver = "mogd"
# (3) the return preferences
return_type = "PO-set"
# Calling the model to solve an MOO problem
po_solutions = moo.solve(
objs=[o1, o2],
configuration=x,
constraints=c,
algo=moo_algo,
solver=moo_solver,
return_type=return_type)
udao
Current code base
Dataset
A TPCH trace with 100K data points (released). The query plan information is maintained in a graph data structure, while other features and objectives are stored in a tabular DataFrame.
dgl_id
in the tabular data.A dataset with more sophisticated operator features:
node_feat_group
with the dimensions as follows:[# of data points with the structure g, g.number_of_nodes, dimension of the operator features]
.'CH1_FEATS': ['dgl_id', 'vid']
, wheredgl_id
represents the index of the graph topology andvid
represents the index of the data point in thenode_feat_group
.Separate code files to construct graph data and tabular data from the raw traces. However, this part can be dropped in our Python library since different datasets may require different construction methods. Instead, we plan to use the same dataset structure to unify the training and optimization processes.
Model
We have implemented various models' architecture in
PyTorch
, including MLP, GTN, GCN, GATv2, and more. Find these implementations here.We have designed a complete pipeline for model training, which includes the following steps:
Optimization
model
submodule in this MOO module is isolated from our modeling part. One solution to connect them is to inherit all the built-in models from the modeling part to theBaseModel
inbase_model.py
.Coding work to be done
We aim to integrate our code into a Python library called "udao," making it accessible for users to install and utilize through a simple
pip install udao
command. Theudao
library is designed to offer three core modules:Data Processing Module (
from udao import dataset
)Modeling Module (
from udao import model
)Optimization Module (
from udao import moo
)We summarize the coding work into three categories.
Data Processing Module
Design and implement a
Dataset
class to support multi-channel feature inputs with two types of data structures and necessary meta information. TheDataset
class maintainsdgl.DGLGraph
g = dgl.graph((src_ids, dst_ids))
g.ndata["feat"]
pandas.DataFrame
["input_records", "input_bytes"]
)["m1", "m2"]
)["k1", "k2"]
)num_of_data
: the total number of data points.all_tfeat_cols
: the column names of all tabular features.all_ofeat_cols
: the column names of the operator features.tfeat_dict
: a dict of the column names of different tabular feature types. E.g.,{"I": ["col1", "col2"], "M": ["m1", "m2"], "C": ["k1", "k2"]]}
ofeat_dict
: a dict of feature indices of the operator features in the graph. E.g.,{"type": [0], "cbo": [1, 2], "predicates": [3, 4, 5]}
dataset.Dataset.declare(ofeats=["col1", "col2", "k1", "k2"], tfeats=["type", "cbo"])
Implement 2-3 built-in datasets by integrating our existing datasets.
Provide a toy example of adding a customized Dataset.
Implement an API to auto-load a built-in or customized dataset, e.g.,
d = dataset.load("TPCH")
Implement an API for the data preprocessing pipeline, including
Modeling Module
Design an abstract class
ModelWrapper
to provide the necessary APIs to seamlessly integrate a model with MOO.def initialize(self, *args)
to initialize the modeldef load(self, *args)
to set the model weights by either loading from the given knowledge or fitting from scratch.def fit(self, dataset, loss, hps, *args)
to train the model with the provided dataset, loss function, and the hyperparameters for training.def predict(self, obj, config, *args)
to obtain the value of the target objective given a configurationImplement built-in models, including
Implement built-in model wrappers by integrating our built-in models.
An API to fetch a built-in
ModelWrapper
, e.g.,m = model.fetch("udao-GTN")
Optimization Module
Refactoring the current code base to have
Variables
to wrap each variable in the optimization problem.Configuration
to define the set of all tunable variables.Objective
to define an objective and specifies the optimization direction.Constraint
to define specific constraints in the optimization problem.Solution
to include a configuration and the corresponding objective values (a set of objective values).ParetoOptimalSet
that encompasses several Pareto-optimal solutions.Implement a pipeline to support end-to-end optimization with the following procedures.
Refactor other utils functionality for the module
The end-to-end usage of
udao
Input/Output Diagram
The I/O of
udao
is as follows and we shall be able to use themoo
module to solve the MOO problem given the dataset and user-defined optimization problem.A Desired Example
An desired way to use
moo
package