avsolatorio / REaLTabFormer-Experiments

This repository contains the materials of the experiments conducted for the REaLTabFormer paper.
MIT License
0 stars 0 forks source link

REaLTabFormer-Experiments

This repository contains the materials of the experiments conducted for the REaLTabFormer paper.

Experiment design

First we identify the datasets that we test in the experiments. For each dataset, we generate N random train-test splits of the data. We structure the data directory as folllows:

- data
    |- input
        |- data_id
            |- split-<seed_1>
            |- split-<seed_2>
            |- split-<seed_3>
            |- split-<seed_...>
            |- split-<seed_N>
    |- models
        |- model_id
            |- data_id
                |- trained_model
                    |- split-<seed_1>
                    |- split-<seed_2>
                    |- split-<seed_3>
                    |- split-<seed_...>
                    |- split-<seed_N>
                |- samples
                    |- split-<seed_1>
                    |- split-<seed_2>
                    |- split-<seed_3>
                    |- split-<seed_...>
                    |- split-<seed_N>
                |- checkpoints
                    |- split-<seed_1>
                    |- split-<seed_2>
                    |- split-<seed_3>
                    |- split-<seed_...>
                    |- split-<seed_N>

Special installation notes

We use the github install since the pip package is not yet updated with the fix that handles the random IndexError in great_utils.py:_convert_text_to_tabular_data:td[values[0]] = [values[1]].

pipenv install -e git+https://github.com/kathrinse/be_great@main#egg=be_great

Environment

Miniconda can be installed, then simply create a python environment.

conda create --name py39 python=3.9
conda activate py39
pip install pipenv

Data sources

The following are the sources of the datasets used in the experiments:

Data summary

Variables annotated with ^ implies categorical data.

Adult Income Dataset

HELOC Dataset

Travel Customers Dataset

Predict Diabetes Dataset

Mobile Price Dataset

Oil Spill Dataset

Customer Personality Dataset

Generating dataset splits

pipenv run python scripts/split_train_test.py

Generating the datasets

We benchmark our model for standard tabular data on datasets used in the https://arxiv.org/abs/2209.15421 paper.

The GitHub repo of the paper specifies how to download the dataset they used in the paper.

conda activate tddpm
cd $PROJECT_DIR
wget "https://www.dropbox.com/s/rpckvcs3vx7j605/data.tar?dl=0" -O data.tar
tar -xvf data.tar

We can also train their models by creating a conda environment as specified below (taken from their repo).

export REPO_DIR=/path/to/the/code
cd $REPO_DIR

conda create -n tddpm python=3.9.7
conda activate tddpm

# if the following commands do not succeed, update conda
conda env config vars set PYTHONPATH=${PYTHONPATH}:${REPO_DIR}
conda env config vars set PROJECT_DIR=${REPO_DIR}

pip install torch==1.10.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

conda deactivate
conda activate tddpm

For reproducibility, we store a copy of the data dump. Note that a copy of this (our project) repo is synced in Google Drive. This allows us to link the data and use it on Google Colab.

Under the project directory, we clone the tab-ddpm repo. We then download the data as specified. Since the datasets included in the dump come in numpy stored object, we implemented a method to recreate the datasets and match, as much as possible, the train-val-test data in the data dump.

For this data generation, the relevant scripts used are:

:warning: pandas and joblib are used to pickle and store the DataFrames: to be sure, use the same pandas version (pandas==1.5.2) used to pickle the data when loading. Otherwise, an error may be raised.

Batch size

We set the target batch size for datasets with training size:

Smaller batch size on small datasets could help in training.

Training the models

# ## Train and Gen samples
# # Bizon Server GPU0
export EXP_VERSION=0.0.1

CUDA_VISIBLE_DEVICES=0 python scripts/pipeline_realtabformer.py --config exp/cardio/realtabformer/${EXP_VERSION}/config.toml --train && \
CUDA_VISIBLE_DEVICES=0 python scripts/pipeline_realtabformer.py --config exp/cardio/realtabformer/${EXP_VERSION}/config.toml --sample --gen_batch=512 && \

CUDA_VISIBLE_DEVICES=0 python scripts/pipeline_realtabformer.py --config exp/gesture/realtabformer/${EXP_VERSION}/config.toml --train && \
CUDA_VISIBLE_DEVICES=0 python scripts/pipeline_realtabformer.py --config exp/gesture/realtabformer/${EXP_VERSION}/config.toml --sample --gen_batch=512 && \

CUDA_VISIBLE_DEVICES=0 python scripts/pipeline_realtabformer.py --config exp/miniboone/realtabformer/${EXP_VERSION}/config.toml --train && \
CUDA_VISIBLE_DEVICES=0 python scripts/pipeline_realtabformer.py --config exp/miniboone/realtabformer/${EXP_VERSION}/config.toml --sample --gen_batch=512

# # Bizon Server GPU1
export EXP_VERSION=0.0.1

CUDA_VISIBLE_DEVICES=1 python scripts/pipeline_realtabformer.py --config exp/fb-comments/realtabformer/${EXP_VERSION}/config.toml --train && \
CUDA_VISIBLE_DEVICES=1 python scripts/pipeline_realtabformer.py --config exp/fb-comments/realtabformer/${EXP_VERSION}/config.toml --sample --gen_batch=512 && \

CUDA_VISIBLE_DEVICES=1 python scripts/pipeline_realtabformer.py --config exp/house/realtabformer/${EXP_VERSION}/config.toml --train && \
CUDA_VISIBLE_DEVICES=1 python scripts/pipeline_realtabformer.py --config exp/house/realtabformer/${EXP_VERSION}/config.toml --sample --gen_batch=512 && \

CUDA_VISIBLE_DEVICES=1 python scripts/pipeline_realtabformer.py --config exp/higgs-small/realtabformer/${EXP_VERSION}/config.toml --train && \
CUDA_VISIBLE_DEVICES=1 python scripts/pipeline_realtabformer.py --config exp/higgs-small/realtabformer/${EXP_VERSION}/config.toml --sample --gen_batch=512

# # Other: https://colab.research.google.com/drive/1bkspGMSimJntE1zBGZsKv3t7RyjSlL28
# if [ `basename "$PWD"` = "REaLTabFormer-Experiments" ]; then echo "hello"; fi
export EXP_VERSION=0.0.1

python scripts/pipeline_realtabformer.py --config exp/churn2/realtabformer/${EXP_VERSION}/config.toml --train && \
python scripts/pipeline_realtabformer.py --config exp/churn2/realtabformer/${EXP_VERSION}/config.toml --sample && \

python scripts/pipeline_realtabformer.py --config exp/diabetes/realtabformer/${EXP_VERSION}/config.toml --train && \
python scripts/pipeline_realtabformer.py --config exp/diabetes/realtabformer/${EXP_VERSION}/config.toml --sample && \

python scripts/pipeline_realtabformer.py --config exp/insurance/realtabformer/${EXP_VERSION}/config.toml --train && \
python scripts/pipeline_realtabformer.py --config exp/insurance/realtabformer/${EXP_VERSION}/config.toml --sample && \

python scripts/pipeline_realtabformer.py --config exp/abalone/realtabformer/${EXP_VERSION}/config.toml --train && \
python scripts/pipeline_realtabformer.py --config exp/abalone/realtabformer/${EXP_VERSION}/config.toml --sample && \

python scripts/pipeline_realtabformer.py --config exp/wilt/realtabformer/${EXP_VERSION}/config.toml --train && \
python scripts/pipeline_realtabformer.py --config exp/wilt/realtabformer/${EXP_VERSION}/config.toml --sample && \

python scripts/pipeline_realtabformer.py --config exp/buddy/realtabformer/${EXP_VERSION}/config.toml --train && \
python scripts/pipeline_realtabformer.py --config exp/buddy/realtabformer/${EXP_VERSION}/config.toml --sample && \

python scripts/pipeline_realtabformer.py --config exp/california/realtabformer/${EXP_VERSION}/config.toml --train && \
python scripts/pipeline_realtabformer.py --config exp/california/realtabformer/${EXP_VERSION}/config.toml --sample && \

python scripts/pipeline_realtabformer.py --config exp/adult/realtabformer/${EXP_VERSION}/config.toml --train && \
python scripts/pipeline_realtabformer.py --config exp/adult/realtabformer/${EXP_VERSION}/config.toml --sample