Open YunjinPebblous opened 9 months ago
Hi, thanks for your interest! What do you mean by a description? If it is the dataset information .json file, you have to create it manually.
We will update this repo soon, including detailed introduction about how to use tabsyn on you own dataset.
HI, Thanks for the reply. I have an additional question. I only have a .CSV dataset. , so the json file will be based on what the json file looks like for datasets like magic,adult,beiging. Is there any format for creating a json file on GitHub yet, right? And I understood that I just need to modify the code in the process_data.py file to fit my data. Is this correct, and is this only possible if the tabular data has no missing values?
Thank you.
[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 24.
2024년 1월 18일 (목) 오후 5:03, Hengrui Zhang @.***>님이 작성:
Hi, thanks for your interest! What do you mean by a description? If it is the dataset information .json file, you have to create it manually.
We will update this repo soon, including detailed introduction about how to use tabsyn on you own dataset.
— Reply to this email directly, view it on GitHub https://github.com/amazon-science/tabsyn/issues/6#issuecomment-1897982481, or unsubscribe https://github.com/notifications/unsubscribe-auth/A75DX74BC2CBDYUWKHCX2OLYPDJNTAVCNFSM6AAAAABB7U6HCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJXHE4DENBYGE . You are receiving this because you authored the thread.Message ID: @.***>
You can create the json file following the given examples. I think the only required information should be the indexes of the numerical and categorical columns (and the target). Your data can have missing values but you have to preprocess into the right format, see below in process_data.py:
line 232-239
for col in num_columns:
train_df.loc[train_df[col] == '?', col] = np.nan
for col in cat_columns:
train_df.loc[train_df[col] == '?', col] = 'nan'
for col in num_columns:
test_df.loc[test_df[col] == '?', col] = np.nan
for col in cat_columns:
test_df.loc[test_df[col] == '?', col] = 'nan'
Thanks. I have an additional question.
Generating report ...(1/2) Evaluating Column Shapes: : 100%|█| 9/9
[00:00<00:00, 354.90it/(2/2) Evaluating Column Pair Trends: : 81%|▊| 29/36
[00:00<00:00,
62/home/yunjin/.pyenv/versions/3.9.12/lib/python3.9/site-packages/scipy/stats/_stats_py.py:4781:
ConstantInputWarning: An input array is constant; the correlation
coefficient is not defined.
warnings.warn(stats.ConstantInputWarning(msg))(2/2) Evaluating Column Pair
Trends: : 100%|█| 36/36 [00:00<00:00, 53Overall Score: 90.01%Properties:-
Column Shapes: 88.52%- Column Pair Trends: 91.5%Generating report ...(1/2)
Evaluating Data Validity: : 100%|█| 9/9 [00:00<00:00, 406.12it/(2/2)
Evaluating Data Structure: : 100%|█| 1/1 [00:00<00:00, 537.32itOverall
Score: 100.0%Properties:- Data Validity: 100.0%- Data Structure:
100.0%Traceback (most recent call last): File
"/home/yunjin/tabsyn/eval/eval_density.py", line 115, in
When I ran the python eval/eval_density.py --dataname [NAME_OF_DATASET] --model [METHOD_NAME] --path [PATH_TO_SYNTHETIC_DATA] code, I got the following error. How can I fix it?
[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 24.
2024년 1월 18일 (목) 오후 5:48, Hengrui Zhang @.***>님이 작성:
You can create the json file following the given examples. I think the only required information should be the indexes of the numerical and categorical columns (and the target). Your data can have missing values but you have to preprocess into the right format, see below in process_data.py:
line 232-239
for col in num_columns: train_df.loc[train_df[col] == '?', col] = np.nan for col in cat_columns: train_df.loc[train_df[col] == '?', col] = 'nan' for col in num_columns: test_df.loc[test_df[col] == '?', col] = np.nan for col in cat_columns: test_df.loc[test_df[col] == '?', col] = 'nan'
— Reply to this email directly, view it on GitHub https://github.com/amazon-science/tabsyn/issues/6#issuecomment-1898042520, or unsubscribe https://github.com/notifications/unsubscribe-auth/A75DX745WWMJ5IYVI5ZWUUTYPDOVVAVCNFSM6AAAAABB7U6HCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYGA2DENJSGA . You are receiving this because you authored the thread.Message ID: @.***>
I guess this is because you are using a new version of "sdmetrics" which has introduced some new features. You can simply comment out these lines since the "Coverage" score is not a promising metric.
(remind mail) HI, I have some questions. If you look at the TABSYN json file, it specifies a target column called target_col_idx. Does this really specify an index for the prediction, or does it just generate synthetic data for that column?
To summarize, is it correct that TABSYN generates synthetic data for all columns when it runs? Or is it only the target column called target_col_idx that generates synthetic data?
thank you
[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 24.
2024년 1월 19일 (금) 오후 4:06, Hengrui Zhang @.***>님이 작성:
I guess this is because you are using a new version of "sdmetrics" which has introduced some new features. You can simply comment out these lines since the "Coverage" score is not a promising metric.
— Reply to this email directly, view it on GitHub https://github.com/amazon-science/tabsyn/issues/6#issuecomment-1899877779, or unsubscribe https://github.com/notifications/unsubscribe-auth/A75DX7ZL645PP6UY2IDSI3TYPILPXAVCNFSM6AAAAABB7U6HCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJZHA3TONZXHE . You are receiving this because you authored the thread.Message ID: @.***>
We generate all columns instead of the single target column. The target column idx is provided since in ML areas, a tabular dataset is usually associated with a regression/classification task, and we have to specify this column for downstream tasks, e.g., Machine Learning Efficiency.
I want to use Tabsyn with a custom dataset.(.csv) How can I obtain a description?