amazon-science / tabsyn

Official Implementations of "Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space""
Apache License 2.0
95 stars 28 forks source link

How can I obtain a description? (.csv) #6

Open YunjinPebblous opened 9 months ago

YunjinPebblous commented 9 months ago

I want to use Tabsyn with a custom dataset.(.csv) How can I obtain a description?

hengruizhang98 commented 9 months ago

Hi, thanks for your interest! What do you mean by a description? If it is the dataset information .json file, you have to create it manually.

We will update this repo soon, including detailed introduction about how to use tabsyn on you own dataset.

YunjinPebblous commented 9 months ago

HI, Thanks for the reply. I have an additional question. I only have a .CSV dataset. , so the json file will be based on what the json file looks like for datasets like magic,adult,beiging. Is there any format for creating a json file on GitHub yet, right? And I understood that I just need to modify the code in the process_data.py file to fit my data. Is this correct, and is this only possible if the tabular data has no missing values?

Thank you.

[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 24.

    1. 오후 05:30:01

2024년 1월 18일 (목) 오후 5:03, Hengrui Zhang @.***>님이 작성:

Hi, thanks for your interest! What do you mean by a description? If it is the dataset information .json file, you have to create it manually.

We will update this repo soon, including detailed introduction about how to use tabsyn on you own dataset.

— Reply to this email directly, view it on GitHub https://github.com/amazon-science/tabsyn/issues/6#issuecomment-1897982481, or unsubscribe https://github.com/notifications/unsubscribe-auth/A75DX74BC2CBDYUWKHCX2OLYPDJNTAVCNFSM6AAAAABB7U6HCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJXHE4DENBYGE . You are receiving this because you authored the thread.Message ID: @.***>

hengruizhang98 commented 9 months ago

You can create the json file following the given examples. I think the only required information should be the indexes of the numerical and categorical columns (and the target). Your data can have missing values but you have to preprocess into the right format, see below in process_data.py:

line 232-239

for col in num_columns:
    train_df.loc[train_df[col] == '?', col] = np.nan
for col in cat_columns:
    train_df.loc[train_df[col] == '?', col] = 'nan'
for col in num_columns:
    test_df.loc[test_df[col] == '?', col] = np.nan
for col in cat_columns:
    test_df.loc[test_df[col] == '?', col] = 'nan'
YunjinPebblous commented 9 months ago

Thanks. I have an additional question.

Generating report ...(1/2) Evaluating Column Shapes: : 100%|█| 9/9 [00:00<00:00, 354.90it/(2/2) Evaluating Column Pair Trends: : 81%|▊| 29/36 [00:00<00:00, 62/home/yunjin/.pyenv/versions/3.9.12/lib/python3.9/site-packages/scipy/stats/_stats_py.py:4781: ConstantInputWarning: An input array is constant; the correlation coefficient is not defined. warnings.warn(stats.ConstantInputWarning(msg))(2/2) Evaluating Column Pair Trends: : 100%|█| 36/36 [00:00<00:00, 53Overall Score: 90.01%Properties:- Column Shapes: 88.52%- Column Pair Trends: 91.5%Generating report ...(1/2) Evaluating Data Validity: : 100%|█| 9/9 [00:00<00:00, 406.12it/(2/2) Evaluating Data Structure: : 100%|█| 1/1 [00:00<00:00, 537.32itOverall Score: 100.0%Properties:- Data Validity: 100.0%- Data Structure: 100.0%Traceback (most recent call last): File "/home/yunjin/tabsyn/eval/eval_density.py", line 115, in coverages = diag_report.get_details('Coverage') File "/home/yunjin/.pyenv/versions/3.9.12/lib/python3.9/site-packages/sdmetrics/reports/base_report.py", line 281, in get_details self._validate_property_generated(property_name) File "/home/yunjin/.pyenv/versions/3.9.12/lib/python3.9/site-packages/sdmetrics/reports/base_report.py", line 235, in _validate_property_generated self._check_property_name(property_name) File "/home/yunjin/.pyenv/versions/3.9.12/lib/python3.9/site-packages/sdmetrics/reports/base_report.py", line 210, in _check_property_name raise ValueError(ValueError: Invalid property name 'Coverage'. Valid property names are 'Data Validity', 'Data Structure'.

When I ran the python eval/eval_density.py --dataname [NAME_OF_DATASET] --model [METHOD_NAME] --path [PATH_TO_SYNTHETIC_DATA] code, I got the following error. How can I fix it?

[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 24.

    1. 오후 01:28:07

2024년 1월 18일 (목) 오후 5:48, Hengrui Zhang @.***>님이 작성:

You can create the json file following the given examples. I think the only required information should be the indexes of the numerical and categorical columns (and the target). Your data can have missing values but you have to preprocess into the right format, see below in process_data.py:

line 232-239

for col in num_columns: train_df.loc[train_df[col] == '?', col] = np.nan for col in cat_columns: train_df.loc[train_df[col] == '?', col] = 'nan' for col in num_columns: test_df.loc[test_df[col] == '?', col] = np.nan for col in cat_columns: test_df.loc[test_df[col] == '?', col] = 'nan'

— Reply to this email directly, view it on GitHub https://github.com/amazon-science/tabsyn/issues/6#issuecomment-1898042520, or unsubscribe https://github.com/notifications/unsubscribe-auth/A75DX745WWMJ5IYVI5ZWUUTYPDOVVAVCNFSM6AAAAABB7U6HCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYGA2DENJSGA . You are receiving this because you authored the thread.Message ID: @.***>

hengruizhang98 commented 9 months ago

I guess this is because you are using a new version of "sdmetrics" which has introduced some new features. You can simply comment out these lines since the "Coverage" score is not a promising metric.

YunjinPebblous commented 9 months ago

(remind mail) HI, I have some questions. If you look at the TABSYN json file, it specifies a target column called target_col_idx. Does this really specify an index for the prediction, or does it just generate synthetic data for that column?

To summarize, is it correct that TABSYN generates synthetic data for all columns when it runs? Or is it only the target column called target_col_idx that generates synthetic data?

thank you

[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 24.

    1. 오전 10:03:40

2024년 1월 19일 (금) 오후 4:06, Hengrui Zhang @.***>님이 작성:

I guess this is because you are using a new version of "sdmetrics" which has introduced some new features. You can simply comment out these lines since the "Coverage" score is not a promising metric.

— Reply to this email directly, view it on GitHub https://github.com/amazon-science/tabsyn/issues/6#issuecomment-1899877779, or unsubscribe https://github.com/notifications/unsubscribe-auth/A75DX7ZL645PP6UY2IDSI3TYPILPXAVCNFSM6AAAAABB7U6HCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJZHA3TONZXHE . You are receiving this because you authored the thread.Message ID: @.***>

hengruizhang98 commented 9 months ago

We generate all columns instead of the single target column. The target column idx is provided since in ML areas, a tabular dataset is usually associated with a regression/classification task, and we have to specify this column for downstream tasks, e.g., Machine Learning Efficiency.