amazon-science / tabsyn

Official Implementations of "Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space""
Apache License 2.0
75 stars 27 forks source link

Hyperparameters for TabDDPM #15

Open ZinebSN opened 4 months ago

ZinebSN commented 4 months ago

Hello @hengruizhang98, I am getting large errors for column wise density estimation when evaluating TabDDPM on some datasets (Beijing and Magic), did you use different hyperparameters for this model for different datasets? It would be very helpful if you can share the set of hyperparameters used for each model-dataset combination. Thanks a lot!

ZinebSN commented 4 months ago

For Great, the synthetic data in "occupation" column has a value of 'Local-gov', but this category doesn't exist in the training set, did you just ignore this category in this feature when applying one hot encoding for quality evaluation (alpha precision and beta recall)?

hengruizhang98 commented 4 months ago

Hello @hengruizhang98, I am getting large errors for column wise density estimation when evaluating TabDDPM on some datasets (Beijing and Magic), did you use different hyperparameters for this model for different datasets? It would be very helpful if you can share the set of hyperparameters used for each model-dataset combination. Thanks a lot!

We simply use the default hyperparameters of TabDDPM for all datasets.

hengruizhang98 commented 4 months ago

For Great, the synthetic data in "occupation" column has a value of 'Local-gov', but this category doesn't exist in the training set, did you just ignore this category in this feature when applying one hot encoding for quality evaluation (alpha precision and beta recall)?

When split the data, please make sure that all categories exist in the training set. If not you can split the data with another seed.

ZinebSN commented 4 months ago

@hengruizhang98 Thanks for your replies. I am having issues reproducing some TabDDPM results (for beijing for example) using the default hyperparameters. Would it be possible to rerun them from your end and share the results you get?