amazon-science / tabsyn

Official Implementations of "Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space""
Apache License 2.0
76 stars 27 forks source link

tabsyn works well in 50 dimensions but cannot be trained to work in 1000 dimensions. #17

Open JunHanStudy opened 4 months ago

JunHanStudy commented 4 months ago

Thank you for your nice project and codebase!

I have trained Tabsyn on our dataset with binary feature of 1000 dimensions. Most dimensions have very low mean, which means most inputs are zero. The loss for VAE and diffusion are both small. But the samples look bad. For the same code, I just train it on first 50 dimensions out of 1000 dimensions, the samples looks good. This should indicate that I ran Tabsyn correctly. Do you have an idea why this could happen?

JunHanStudy commented 4 months ago

For the same code, I also trained it on first 100 dimensions out of 1000 dimensions, the samples look good. But when I trained it on first 200 dimensions out of 1000 dimensions, the samples start looking bad. Do you have an idea why this could happen? Thank you!

hengruizhang98 commented 4 months ago

Hi, thanks for your question! I guess this is because your raw data is already extremely sparse, and our VAE processing have made the latent embeddings even sparser (since it expands the dimension of your data from 1000 to 4000). It might be challenging for the diffusion model to learn from such high-dimensional data.

Some questions/suggestions:

  1. What's the size of your datasets, i,e., how many rows?
  2. Have you tried TabDDPM on your datasets?
  3. Considering that all of the columns are binary categorical variables, I would suggest directly training the diffusion model in the input space without VAE.
JunHanStudy commented 4 months ago

Thank you very much for your reply!

  1. The size of our training data is 1.5M.
  2. TabDDPM works well in our dataset of 1000 dimensions. I guess one reason could be that TabDDPM which directly learns the categorical distribution can work well in the sparse setting.
  3. We will try this. As for those columns with sparse features, even all reconstructing zero will have a small reconstruction loss in VAE.
hengruizhang98 commented 3 months ago

@JunHanStudy Have you solved the issue? If you haven't, would you please share (partial) your dataset so that we can do further study about that?

JunHanStudy commented 3 months ago

@hengruizhang98 Thank you very much for your follow up! The performance degeneration in high dimension is not solved. We work on healthcare datasets. This link (https://github.com/sczzz3/ehrdiff) has processed MIMIC dataset. Each dimension is a binary variable. Please let me know if you can use this processed dataset easily. If you want this dataset be the format in your paper, I can process it for you. Look forward to seeing some progress of tabsyn in this domain.

hengruizhang98 commented 3 months ago

Thanks! I will look through it.