amazon-science / tabsyn

Official Implementations of "Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space""
Apache License 2.0
76 stars 27 forks source link

CTGAN baseline #12

Closed ZinebSN closed 6 months ago

ZinebSN commented 6 months ago

Hello @hengruizhang98!

Could you please provide more instructions on how to run the CTGAN and TVAE baselines? I can see some arguments for the CTGAN baseline in utils.py, I am wondering if you have standardized the original implementation like other baselines (e.g. goggle and tabddpm)?

Thank you!

hengruizhang98 commented 6 months ago

We have the same implementations for CTGAN/TVAE, but we can not publish them due to CTGAN's license (to avoid legal disputes). If you are interested I will update them in my personal repo (https://github.com/hengruizhang98/tabsyn).

ZinebSN commented 6 months ago

Yes Please this would help! Thank you!

I have another question regarding the adaptation of goggle to process tabular data with both numerical and categorical features, I am trying it on some datasets and the loss become Nan after few epochs, did you encounter this as well? Thanks in advance!

hengruizhang98 commented 6 months ago

Yes Please this would help! Thank you!

I have another question regarding the adaptation of goggle to process tabular data with both numerical and categorical features, I am trying it on some datasets and the loss become Nan after few epochs, did you encounter this as well? Thanks in advance!

We didn't encounter this problem when running goggle, and I recommend you check the intermediate values during training for debugging. In our implementation, we use the one-hot encoding for categorical variables. In this case, a numerical column becomes one node in the graph, while a categorical column becomes C_i nodes in the graph, and the values of the categorical columns are sparse.

I guess a better solution is to transform every column to latent vectors of the same dimension (like the VAE in our TabSyn). Then each node will represent a column, and the training might be more stable since the values are continous.

hengruizhang98 commented 6 months ago

@ZinebSN

The codes for CTGAN and TVAE have been uploaded here: https://github.com/hengruizhang98/tabsyn/tree/ctgan. Feel free to use it :).

ZinebSN commented 6 months ago

Thank you :)