croesuslab / RCTGAN

This package implements the RC-TGAN method, which generates synthetic data from a relational database.
Other
3 stars 3 forks source link

No of rows mismatched in samples #15

Open poorneshwaran opened 1 month ago

poorneshwaran commented 1 month ago

Hi, I've recently explored this model for generate synthetic data. The real data shape and sample data shape are mismatched. Like if the real data shape i.e. tables['atom'] is 6568x3 columns, but the sample generates 6140x3. but am expecting the shape should be same for both the real and sample(synthetic).

mohamedgy commented 1 month ago

Hi, In a parent-child table relationship (like 'molecule' and 'atom'), each 'molecule' row has a number of associated 'atom' rows, called its cardinality. The total number of 'atom' rows is the sum of all 'molecule' cardinalities.

When generating synthetic data, RCTGAN predicts these cardinalities based on other 'molecule' features. Since these predictions might differ from real cardinalities, the synthetic 'atom' table may have a different size than the real one.

poorneshwaran commented 1 month ago

Hi, I'm used to try multi table synthetic generation. I want to populate specific/different number of rows in each table. I have passed the syntax like hyper = {'sub': {'num-samples':100}, 'mh': {'num-samples': 100}, 'test': {'num-samples':200}, } model = RCTGAN(metadata,hyper) model.fit(tables). But fails to populate the number of rows.

mohamedgy commented 1 month ago

Hi,

As I mentioned earlier, the size of child tables is directly influenced by the size of their parent tables. If you need to make the child tables smaller, you can reduce the size of the corresponding parent tables during the training process. However, it's important to note that you won't have precise control over the exact size of the child tables.

For information on how to use the hyperparameters, you can check out this resource: https://github.com/croesuslab/RCTGAN?tab=readme-ov-file#3-hyperparameters-configuration

Let me know if you have any other questions!

poorneshwaran commented 1 month ago

Thanks for info, but if added constraints it doesnt seems to work well. I've passed the syntax like Custom_constraint_smo = CustomConstraint(columns=['su_ques1','su_ques2'],is_valid=is_valid)

and pass to metainfo metadata.add_table( name='sub', data=tables['sub'], primary_key='subjectid', fields_metadata = sub_data_info, constraints=[Custom_constraint_smo]

)

fails to predict wrongly. but my condition like if su_ques1 is yes, some numbers in su_ques2 else 0.