OpenGen / GenSQL.inference

Apache License 2.0
2 stars 2 forks source link

Hyper-parameters are deterministically initialized before structure learning #20

Open Schaechtle opened 2 years ago

Schaechtle commented 2 years ago

Column hyper-parameters are all initialized to the same value and never get updated before the MCMC kernels are applied for inference. Once the Gibbs kernel for column hyperparameters is applied, the hyper-parameters are sampled from an empirical grid prior (the grid is initialized with data within the individual primitives, e.g. here for the Gaussian primitive GPM). This process requires access to the data. The function initializing a CrossCat model for model building doesn't have access to the data, so it can't do a better job during the initialization (it doesn't know the grid to sample from). Now, that's not ideal as it makes the inference for model-building harder than it needs to be.

There are three ways to fix this:

  1. During calls incorporate for incorporating rows, the code could call the hyper-parameter inference kernel for each row. That's what Python-CGPM does. On the face of it, it seems unnecessarily inefficient.
  2. One could construct the XCat initial model from types and data (unlike just types here). Then, the grid would exist during initialization, and you could sample
  3. One creates initial hyper-grids -- taken from looking at n different datasets. Those should eventually get replaced by empirical hypergrids from the current data. This would help with demos of sequential inference, i.e. incorporating row-by-row, and doing inference after each incorporation.
Schaechtle commented 2 years ago

RE 1: need to clarify whether to use insert or incorporate and what kernels are called for either.