Issue of NaN values in synthetic data

akashsonowal commented 2 years ago

Hi team!

I am facing issues with getting NaN values in some columns of synthetic data. I am not able to figure out the root cause of the problem.

Anything that I can do to prevent this?

zhao-zilong commented 2 years ago

Hi, I think the reason is because in your original data, it contains empty cell. In our code we learn that and produce the similar data, which means also containing empty cell, i.e., NaN.

akashsonowal97 commented 1 year ago

Hi zhao!

Thanks for your response. I would like to share the issues in little more detail.

I have done several experimentations with different datasets and my observations are listed below:

No Nan values are present in original CSV files but still Nan values are showing up in synthetic data output.
When no mixed columns are provided code crashes and throws error.
In a use case of bank transactions dataset, i have tried few combinations of general, mixed, and integer columns as listed below:

We have 4 features, i.e., account no, withdrawal, deposit, transaction status. The withdrawal and deposit columns may have 0 as a number because a transaction can be either withdrawal or deposit.

When withdrawal and deposit given as mixed col and as well as general column, code return NaN When withdrawal and deposit are given as mixed columns and as well as integer columns, I got the results.

There are few questions at my end it would be great if I can get clarity on the same.

Why code is giving error i am not providing any mix col. as this is very much frequent observation even in the data set if no mixed columns are present i have to give mixed columns to run the code.

Could you please let me know what is your interpretation of mixed columns, general columns, integer columns, log columns, non-categorical columns.

How you have handled mixed columns?

zhao-zilong commented 1 year ago

Hi, I don't know if you can set a column both mixed type and general transform. What type of column do you have? But of course you don't need to have mixed type data in your dataset, there must be something wrong. General columns is used when (1) your categorical column is high dimensional (2) your continuous column is single Gaussian distribution. For others, you need to check our ctabgan+ paper, I will be too long to write here.

Team-TUD / CTAB-GAN-Plus

Issue of NaN values in synthetic data #3