Open JunHanStudy opened 4 months ago
For the same code, I also trained it on first 100 dimensions out of 1000 dimensions, the samples look good. But when I trained it on first 200 dimensions out of 1000 dimensions, the samples start looking bad. Do you have an idea why this could happen? Thank you!
Hi, thanks for your question! I guess this is because your raw data is already extremely sparse, and our VAE processing have made the latent embeddings even sparser (since it expands the dimension of your data from 1000 to 4000). It might be challenging for the diffusion model to learn from such high-dimensional data.
Some questions/suggestions:
Thank you very much for your reply!
@JunHanStudy Have you solved the issue? If you haven't, would you please share (partial) your dataset so that we can do further study about that?
@hengruizhang98 Thank you very much for your follow up! The performance degeneration in high dimension is not solved. We work on healthcare datasets. This link (https://github.com/sczzz3/ehrdiff) has processed MIMIC dataset. Each dimension is a binary variable. Please let me know if you can use this processed dataset easily. If you want this dataset be the format in your paper, I can process it for you. Look forward to seeing some progress of tabsyn in this domain.
Thanks! I will look through it.
Thank you for your nice project and codebase!
I have trained Tabsyn on our dataset with binary feature of 1000 dimensions. Most dimensions have very low mean, which means most inputs are zero. The loss for VAE and diffusion are both small. But the samples look bad. For the same code, I just train it on first 50 dimensions out of 1000 dimensions, the samples looks good. This should indicate that I ran Tabsyn correctly. Do you have an idea why this could happen?