Open caimiao0714 opened 1 year ago
Hello @caimiao0714,
Thank you for the kind words and your interest in the repo. :) From what I understand, your data is such that you have 2001 different data points each with 5 values (SO4 , NO3, NH4, OM, BC). Did I understand correctly?
In such a case, you should only specify the dimension of your data points (i.e 5 in your case) in the input_dim
argument of the VAEConfig
instance. See below a working examples adapted from your case but with random values
from pythae.pipelines import TrainingPipeline
from pythae.models import VAE, VAEConfig
from pythae.trainers import BaseTrainerConfig
import numpy as np
import torch
# dummy datasets
dl_dt = torch.randn(2001, 5)
my_training_config = BaseTrainerConfig(
output_dir='./',
num_epochs=5,
learning_rate=1e-3,
per_device_train_batch_size=200,
per_device_eval_batch_size=200,
train_dataloader_num_workers=2,
eval_dataloader_num_workers=2,
steps_saving=20,
optimizer_cls="AdamW",
optimizer_params={"weight_decay": 0.05, "betas": (0.91, 0.995)},
scheduler_cls="ReduceLROnPlateau",
scheduler_params={"patience": 5, "factor": 0.5}
)
# Set up the model configuration
my_vae_config = VAEConfig(
input_dim=(5,), ####### This is what changed from your code #######
latent_dim=10
)
# Build the model
my_vae_model = VAE(model_config=my_vae_config)
# Build the Pipeline
pipeline = TrainingPipeline(
training_config=my_training_config,
model=my_vae_model
)
dl_train_sample = dl_dt[0:1000,:].numpy()
dl_eval_sample = dl_dt[1001:2001,:].numpy()
# Launch the Pipeline
pipeline(
train_data=dl_train_sample, # must be torch.Tensor, np.array or torch datasets
eval_data=dl_eval_sample # must be torch.Tensor, np.array or torch datasets
)
PS: Do not hesitate to adapt the neural networks you use for the encoder and decoder to make it better suited for tabular data as well.
I hope this helps!
Best,
Clément
Hi Clément,
Thank you! This helps a lot. One more question is the step on data generation after fitting the model. I notice that the example in the official manual generates new data as pictures (.png
). I wonder if you could give an example that the data are generated as tabular data? Specifically, I would be interested in generating the disentangled tabular data for dl_train_sample
and dl_eval_sample
row by row.
Thanks, Miao
Hi @caimiao0714, I am glad to see that my previous comment helped :)
As to the generation of synthetic data, it is indeed performed after training the model. For instance, assuming that you have trained the model as explained in the previous comment, you can generate new synthetic tabular data as follows:
from pythae.models import AutoModel
from pythae.samplers import NormalSampler
# reload the trained model for the folder where it was stored
trained_model = AutoModel.load_from_folder('VAE_training_2023-03-23_18-25-25/final_model').eval()
# Create the sampler
sampler = NormalSampler(trained_model)
# Launche the sample function
gen_samples = sampler.sample(
num_samples=100, # specify the number of samples you want to generate
return_gen=True # specify that you want the sampler to return the generated samples
)
print(gen_samples.shape)
As to generating disentangled data, did you mean this in the sense of #78 ?
I hope this helps :)
Best,
Clément
Hi Clément,
Thanks for your help in generating samples. This is very useful!
For generating disentangled data, I'm not sure if I fully understand issue #78. Let me try to illustrate my point in a simpler way, and hopefully I could clearly illustrate my point.
Problem setting. For the dummy dataset generated by dl_dt = torch.randn(2001, 5)
, let's assume that it is a tensor with 5 features ($x_1, x_2, \ldots, x_5$), and actually I was trying to construct a supervised machine learning model for a dependent variable $y$ (dl_y = torch.randn(2001, 1)
). Let's assume that the supervised machine learning model is a simple linear model.
Why I chose disentanglement learning The reason why I'm trying to apply disentanglement learning for the dataset dl_dt
is the features $x_1, x_2, \ldots, x_5$ are highly correlated, and putting them all in the linear regression will cause the problem of multicollinearity. Therefore, I'm trying to use disentanglement learning models to disentangle $x_1, x_2, \ldots, x_5$ into relatively independent features $\hat{x}_1, \hat{x}_2, \ldots, \hat{x}_5$ (actually the disentangled features could be any number of features). After that, I could use the disentangled features $\hat{x}_1, \hat{x}_2, \ldots, \hat{x}_5$ to predict $y$ (dl_y
) and would not have the issue of multicollinearity anymore.
Problem with the current code At this stage, hopefully, you could see the problem with gen_samples
in your last response. These generated data (gen_samples
) are not related to the original data dl_dt
by row, so they cannot be used to predict $y$ (dl_y
) in the supervised machine learning models afterward.
I hope that my question and problem are clear.
Thanks, Miao
Hi @caimiao0714,
Sorry or the late reply. From what I understand (tell me if I am wrong), you would like to use a different representation of the input data that can be used as input for your supervised model. If so, you can definitely do this using the models available in the library. You can for instance use as inputs of your model the latent representations of dl_dt
. To retrieve the latent representation of your input, you can do the following using the embed
method.
from pythae.models import AutoModel
# Reload the train model
trained_model = AutoModel.load_from_folder('path/to/model').eval()
# Get the embeddings
embeddings = trained_model.embed(torch.from_numpy(dl_train_sample))
In such a case, each row of embeddings
corresponds then to the representation of the row dl_train_sample
in the latent space.
I hope this helps.
Best,
Clément
Hi Clément,
Thanks a lot for the comment. Yes this works. One additional question I have is how do I gain insights into the relationship between the original data and the embeddings
in the latent space. I tried to use Pearson correlation coefficients understand these two, but I found little correlations, see the figure below.
BC
, NH4
, ..., and SO4
on the x-axis are the original data, and V0
to V4
on the y-axis are the latent embeddings.
Miao
Hi @caimiao0714,
I am happy to see that this is working. As to the relationship between the latent embeddings and the input data, I am not sure what you are expecting from this. The VAE model will embed the input data in the latent space using potentially highly non-linear functions and so I am not sure that you will be able to relate the latent embedding coordinates directly to those of the input data. Nonetheless, you can still try with models that specifically target the tasks of learning disentangled representations such as the $\beta$-VAE, factorVAE of $\beta$-TC-VAE. Maybe those models can be helpful as well.
Best,
Clément
Hi Clément,
Thanks for creating and maintaining this great repo. I'm a biostatistician working on environmental epidemiology (meaning that I'm new to machine learning and my questions may be naive), and I'm trying to tackle the high correlation issue with VAE or disentanglement learning.
My question is quite different (in my view) from the questions in the example code: the data in my field are tabular datasets with observations in the rows and variables in the columns (2Ds), while the example data and code in the repo are mostly images (3Ds). I'm wondering how could I set up the correct dataset form and input dimension for
benchmark_VAE
to work? Please see a small example data below.My aim is to reduce the y-dimension of this data set because they (SO4 , NO3, NH4, OM, BC) are highly correlated, and putting them in one model will cause the issue of variance inflation. I wonder how could I set up the right
benchmark_VAE
code to achieve this aim. Currently my code looks like this:But it reported the following error. I guess I did not set up the input datasets and input dimensions correctly. Any ideas would be appreciated.
Thanks, Miao