cmu-db / ottertune

The automatic DBMS configuration tool
Other
1.21k stars 311 forks source link

fix GPR and scaler #418

Closed bohanjason closed 4 years ago

bohanjason commented 4 years ago

1) Change variables likelihood variance, kernel's lengthscale and variance to non-trainable

Without the fix, it should be fine in most cases because the changes of these trainable variables are small. However, in some cases, it may lead to issue like this: #416

  1. add Xmin and Xmax when fitting the X scaler. The results will be more stable.
  2. when aggregating target data, it will get all the data with same workload name from all sessions within the same project. e.g., session A and B both have workload tpcc, then it will aggregate data from both sessions as training data.
  3. fix performance test.
dvanaken commented 4 years ago

Nice catch with the @autoflow decorator. It looks like GPFlow deprecated that decorator back in April: https://github.com/GPflow/GPflow/blob/dda8c39889acb6edf237613663ccbcb09888970d/doc/source/notebooks/gpflow2_upgrade_guide.md#L119

GPflow's GPR model class has changed a lot over the years - can you double-check whether we will still need to subclass it to cache the cholesky matrix? If not then we can delete ottertune/server/analysis/gpr/gprc.py and avoid issues like this in the future.

bohanjason commented 4 years ago

I don't think we need to cache the cholesky matrix. We cannot do this because the matrix is updated during the training. I will remove the compute_cache function in gprc.py

If the ikelihood variance, kernel's lengthscale and variance are not trainable , the matrix will not change, and we can cache it. However, in GPFlow implementation, these variables are trainable, then the matrix cannot be cached.

dvanaken commented 4 years ago

Do you still need any of the functionality in gprc.py? I think you can remove it completely and import GPflow's GPR class directly.

bohanjason commented 4 years ago

I don't need this. I will remove it

bohanjason commented 4 years ago

@dvanaken I think we can keep gprc.py. It makes it easy to debug. For example, we can print choleysky matrix and other intermediate results. If we use gpr directly, we cannot print intermediate results. Any comments ?

bohanjason commented 4 years ago

It seems that the model does not work in the simulation environment. Although it works for tpcc workload according to my experiment, I worried about it cannot work for more complex workload/case.

If the simulation number is correct, it means we cannot set likelihood variance, kernel's lengthscale and variance as trainable variables . However, in GPFlow implementation, they are trainable. I am trying to figure out whether we can set them static in GPFlow.

bohanjason commented 4 years ago

Change likelihood variance, kernel's lengthscale and variance to non-trainable. Cache is still needed then.

bohanjason commented 4 years ago

@dvanaken , ready to be merged. Thanks