croesuslab / RCTGAN

This package implements the RC-TGAN method, which generates synthetic data from a relational database.
Other
3 stars 3 forks source link

Support for LableEncoding #13

Closed navalchand closed 1 month ago

navalchand commented 1 year ago

Hello,

Thank you for providing the paper and repository. I have a quick question: How can we enforce label encoding for all the categorical columns across all tables in order to avoid the memory-intensive OneHotEncoding?

Ideally, for initializing CTGAN in SDV, we pass the following field_transformers:

field_transformers={ 'col1': 'label_encoding', 'col2': 'label_encoding', 'col3': 'label_encoding' }

navalchand commented 1 year ago

I attempted to update the fit method of RCTGAN to include field_transformers and pass it to CTGAN and PC_CTGAN. However, despite these changes, the issue of running out of memory still persists, which I had hoped would be resolved.

mohamedgy commented 1 year ago

Hello @navalchand,

Thank you for the issue. Transforming categorical columns into numerical columns via OneHotEncoding is the essence of CTGAN method. The OneHotEncoding transformer allows CTGAN to deal with imbalanced categorical columns via Conditional Generator and Training-by-Sampling. Otherwise, memory-intensive due to OneHotEncoding is a relevant reason to use another transformer but you can get a decrease in the quality of synthetic data: this is trade-off. In the next version, we'll offer the possibility to change field_transformers from RCTGAN class.