VAE for tabular data for dimension reduction

caimiao0714 commented 1 year ago

Hi Clément,

Thanks for creating and maintaining this great repo. I'm a biostatistician working on environmental epidemiology (meaning that I'm new to machine learning and my questions may be naive), and I'm trying to tackle the high correlation issue with VAE or disentanglement learning.

My question is quite different (in my view) from the questions in the example code: the data in my field are tabular datasets with observations in the rows and variables in the columns (2Ds), while the example data and code in the repo are mostly images (3Ds). I'm wondering how could I set up the correct dataset form and input dimension for benchmark_VAE to work? Please see a small example data below.

dl_train
    |     SO4         NO3          NH4        OM        BC
--- + ----------  ----------  ----------  ---------  ---------
  0 |    9.75255    12.2174      7.41296    14.8118    2.77726
  1 |    7.41267     9.18699     6.56743    10.5916    1.89571
  2 |    7.67942     9.50747     6.8047     10.9361    1.95844
  3 |    7.16214     8.52167     6.206      10.1743    1.84438
  4 |    9.68588    12.1869      7.39739    14.6614    2.74922
  5 |    9.65254    12.7658      8.79049    13.3749    2.51152
  6 |   10.7724     13.1742      9.19811    13.9698    2.59258
  7 |    9.10471    12.2183      8.55336    13.3383    2.52734
  8 |    9.18762    12.3056      8.6236     13.4588    2.55033
  9 |   13.2112     14.4727     10.4412     15.7128    2.99749
 10 |   12.5401     15.4839     10.0014     16.7971    3.1917 
 11 |    7.55747     8.95752     6.55564    10.9204    1.99788
 12 |    7.7435      9.82413     6.95538    11.0182    1.96674
 13 |    7.32966     9.23089     6.57861    11.0256    2.03959
 14 |    9.74056    12.1903      7.39755    14.7946    2.77484
  … |          …           …           …          …          …
995 |    9.21827    11.2173      7.82692    10.9786    2.02918
996 |   11.0007     14.519       9.44757    16.752     3.17391
997 |    7.50137     9.51056     6.79015    11.4156    2.14255
998 |    8.20999    10.6865      7.28414    11.2109    2.12318
999 |   14.7959     15.3163     10.5064     20.9706    4.47214

My aim is to reduce the y-dimension of this data set because they (SO4 , NO3, NH4, OM, BC) are highly correlated, and putting them in one model will cause the issue of variance inflation. I wonder how could I set up the right benchmark_VAE code to achieve this aim. Currently my code looks like this:

from pythae.pipelines import TrainingPipeline
from pythae.models import VAE, VAEConfig
from pythae.trainers import BaseTrainerConfig
import numpy as np

my_training_config = BaseTrainerConfig(
  output_dir='./',
  num_epochs=5,
  learning_rate=1e-3,
  per_device_train_batch_size=200,
  per_device_eval_batch_size=200,
  train_dataloader_num_workers=2,
  eval_dataloader_num_workers=2,
  steps_saving=20,
  optimizer_cls="AdamW",
  optimizer_params={"weight_decay": 0.05, "betas": (0.91, 0.995)},
  scheduler_cls="ReduceLROnPlateau",
  scheduler_params={"patience": 5, "factor": 0.5}
)

# Set up the model configuration 
my_vae_config = VAEConfig(
  input_dim=(1000, 6),
  latent_dim=10
)
# Build the model
my_vae_model = VAE(model_config=my_vae_config)
# Build the Pipeline
pipeline = TrainingPipeline(
    training_config=my_training_config,
    model=my_vae_model
)

dl_train_sample = dl_dt[0:1000,:].to_numpy()
dl_eval_sample = dl_dt[1001:2001,:].to_numpy()

# Launch the Pipeline
pipeline(
  train_data=dl_train_sample, # must be torch.Tensor, np.array or torch datasets
  eval_data=dl_eval_sample # must be torch.Tensor, np.array or torch datasets
)

But it reported the following error. I guess I did not set up the input datasets and input dimensions correctly. Any ideas would be appreciated.

Preprocessing train data...
INFO:pythae.pipelines.training:Preprocessing train data...
Checking train dataset...
INFO:pythae.pipelines.training:Checking train dataset...
Preprocessing eval data...

INFO:pythae.pipelines.training:Preprocessing eval data...

Checking eval dataset...
INFO:pythae.pipelines.training:Checking eval dataset...
Using Base Trainer

INFO:pythae.pipelines.training:Using Base Trainer

ModelError: Error when calling forward method from model. Potential issues: 
 - Wrong model architecture -> check encoder, decoder and metric architecture if you provide yours 
 - The data input dimension provided is wrong -> when no encoder, decoder or metric provided, a network is built automatically but requires the shape of the flatten input data.
Exception raised: <class 'RuntimeError'> with message: shape '[-1, 6000]' is invalid for input of size 1000

Thanks, Miao

clementchadebec commented 1 year ago

Hello @caimiao0714,

Thank you for the kind words and your interest in the repo. :) From what I understand, your data is such that you have 2001 different data points each with 5 values (SO4 , NO3, NH4, OM, BC). Did I understand correctly?

In such a case, you should only specify the dimension of your data points (i.e 5 in your case) in the input_dim argument of the VAEConfig instance. See below a working examples adapted from your case but with random values

from pythae.pipelines import TrainingPipeline
from pythae.models import VAE, VAEConfig
from pythae.trainers import BaseTrainerConfig
import numpy as np
import torch

# dummy datasets
dl_dt = torch.randn(2001, 5)

my_training_config = BaseTrainerConfig(
  output_dir='./',
  num_epochs=5,
  learning_rate=1e-3,
  per_device_train_batch_size=200,
  per_device_eval_batch_size=200,
  train_dataloader_num_workers=2,
  eval_dataloader_num_workers=2,
  steps_saving=20,
  optimizer_cls="AdamW",
  optimizer_params={"weight_decay": 0.05, "betas": (0.91, 0.995)},
  scheduler_cls="ReduceLROnPlateau",
  scheduler_params={"patience": 5, "factor": 0.5}
)

# Set up the model configuration 
my_vae_config = VAEConfig(
  input_dim=(5,), ####### This is what changed from your code #######
  latent_dim=10
)
# Build the model
my_vae_model = VAE(model_config=my_vae_config)
# Build the Pipeline
pipeline = TrainingPipeline(
    training_config=my_training_config,
    model=my_vae_model
)

dl_train_sample = dl_dt[0:1000,:].numpy()
dl_eval_sample = dl_dt[1001:2001,:].numpy()

# Launch the Pipeline
pipeline(
  train_data=dl_train_sample, # must be torch.Tensor, np.array or torch datasets
  eval_data=dl_eval_sample # must be torch.Tensor, np.array or torch datasets
)

PS: Do not hesitate to adapt the neural networks you use for the encoder and decoder to make it better suited for tabular data as well.

I hope this helps!

Best,

Clément

caimiao0714 commented 1 year ago

Hi Clément,

Thank you! This helps a lot. One more question is the step on data generation after fitting the model. I notice that the example in the official manual generates new data as pictures (.png). I wonder if you could give an example that the data are generated as tabular data? Specifically, I would be interested in generating the disentangled tabular data for dl_train_sample and dl_eval_sample row by row.

Thanks, Miao

clementchadebec commented 1 year ago

Hi @caimiao0714, I am glad to see that my previous comment helped :)

As to the generation of synthetic data, it is indeed performed after training the model. For instance, assuming that you have trained the model as explained in the previous comment, you can generate new synthetic tabular data as follows:

from pythae.models import AutoModel
from pythae.samplers import NormalSampler

# reload the trained model for the folder where it was stored
trained_model = AutoModel.load_from_folder('VAE_training_2023-03-23_18-25-25/final_model').eval()

# Create the sampler 
sampler = NormalSampler(trained_model)

# Launche the sample function
gen_samples = sampler.sample(
    num_samples=100, # specify the number of samples you want to generate
    return_gen=True # specify that you want the sampler to return the generated samples 
)
print(gen_samples.shape)

As to generating disentangled data, did you mean this in the sense of #78 ?

I hope this helps :)

Best,

Clément

caimiao0714 commented 1 year ago

Hi Clément,

Thanks for your help in generating samples. This is very useful!

For generating disentangled data, I'm not sure if I fully understand issue #78. Let me try to illustrate my point in a simpler way, and hopefully I could clearly illustrate my point.

Problem setting. For the dummy dataset generated by dl_dt = torch.randn(2001, 5), let's assume that it is a tensor with 5 features ($x_1, x_2, \ldots, x_5$), and actually I was trying to construct a supervised machine learning model for a dependent variable $y$ (dl_y = torch.randn(2001, 1)). Let's assume that the supervised machine learning model is a simple linear model.

Why I chose disentanglement learning The reason why I'm trying to apply disentanglement learning for the dataset dl_dt is the features $x_1, x_2, \ldots, x_5$ are highly correlated, and putting them all in the linear regression will cause the problem of multicollinearity. Therefore, I'm trying to use disentanglement learning models to disentangle $x_1, x_2, \ldots, x_5$ into relatively independent features $\hat{x}_1, \hat{x}_2, \ldots, \hat{x}_5$ (actually the disentangled features could be any number of features). After that, I could use the disentangled features $\hat{x}_1, \hat{x}_2, \ldots, \hat{x}_5$ to predict $y$ (dl_y) and would not have the issue of multicollinearity anymore.

Problem with the current code At this stage, hopefully, you could see the problem with gen_samples in your last response. These generated data (gen_samples) are not related to the original data dl_dt by row, so they cannot be used to predict $y$ (dl_y) in the supervised machine learning models afterward.

I hope that my question and problem are clear.

Thanks, Miao

clementchadebec commented 1 year ago

Hi @caimiao0714,

Sorry or the late reply. From what I understand (tell me if I am wrong), you would like to use a different representation of the input data that can be used as input for your supervised model. If so, you can definitely do this using the models available in the library. You can for instance use as inputs of your model the latent representations of dl_dt. To retrieve the latent representation of your input, you can do the following using the embed method.

from pythae.models import AutoModel

# Reload the train model
trained_model = AutoModel.load_from_folder('path/to/model').eval()

# Get the embeddings
embeddings = trained_model.embed(torch.from_numpy(dl_train_sample))

In such a case, each row of embeddings corresponds then to the representation of the row dl_train_sample in the latent space.

I hope this helps.

Best,

Clément

caimiao0714 commented 1 year ago

Hi Clément,

Thanks a lot for the comment. Yes this works. One additional question I have is how do I gain insights into the relationship between the original data and the embeddings in the latent space. I tried to use Pearson correlation coefficients understand these two, but I found little correlations, see the figure below.

BC, NH4, ..., and SO4 on the x-axis are the original data, and V0 to V4 on the y-axis are the latent embeddings.

Miao

clementchadebec commented 1 year ago

Hi @caimiao0714,

I am happy to see that this is working. As to the relationship between the latent embeddings and the input data, I am not sure what you are expecting from this. The VAE model will embed the input data in the latent space using potentially highly non-linear functions and so I am not sure that you will be able to relate the latent embedding coordinates directly to those of the input data. Nonetheless, you can still try with models that specifically target the tasks of learning disentangled representations such as the $\beta$-VAE, factorVAE of $\beta$-TC-VAE. Maybe those models can be helpful as well.

Best,

Clément

clementchadebec / benchmark_VAE

VAE for tabular data for dimension reduction #81