Team-TUD / CTAB-GAN-Plus

Official GitHub for CTAB-GAN+
71 stars 10 forks source link

non_categorical_columns clarification #4

Closed SvenGroen closed 1 year ago

SvenGroen commented 1 year ago

Hey,

I wanted to ask what kind of columns should be included into the "non_categorical_columns" list, as I could not find an explanation. From looking at the code, I would guess that "non_categorical_columns" are "categorical columns, that are already numeric (e.g. Label encoded)".

Can you confirm that I understood this correct? If not, can you clarify the purpose.

Cheers, Sven

zhao-zilong commented 1 year ago

Hi Sven,

Very good question, we forgot to mention that. To include columns in "non_categorical_columns", you actually need to add the column also in "categorical_columns". I know it sounds weird, we should change that later. "non_categorical_columns" means that the column is categorical but it can be very high dimensional, so we deal it as continuous. For columns in "non_categorical_columns", we first encode the columns to numerical number, and then treat it as continuous column (using variational gaussian mixture). If you also add the column in "general_transform", it will first encode the column in numerical number and then treat it by "general_transform" instead of default continuous column encoding.

Hope you can understand better now.

Best,

Zilong

SvenGroen commented 1 year ago

Hi Zilong,

Yes, the name is indeed a bit misleading. But your explanation makes totally sense. Thanks for the clarification!

Best, Sven