SundareshSankaran / SDG---SMOTE-Synthetic-Data-Generation

0 stars 0 forks source link

Synthetic Minority Oversampling TEchnique (SMOTE) Synthetic Data Generation

This custom step helps you generate synthetic data based on an input table, using the Synthetic Minority Oversampling TEchnique (SMOTE). SMOTE is an oversampling technique which identifies new data observations in the neighborhood of closely associated original observations.

SMOTE is an alternative approach to Generative Adversarial Networks (GANs) for generating synthetic tabular data. Access to synthetic data helps you make better, data-informed decisions in situations where you have imbalanced, scant, poor quality, unobservable, or restricted data.

A general idea

This animated gif provides a basic idea:

SDG - SMOTE


Table of Contents

  1. Requirements

  2. Parameters

    1. Input Parameters
    2. Configuration
    3. Output Specifications
  3. Run-time Control

  4. Documentation

  5. SAS Program

  6. Installation and Usage

  7. Created/Contact

  8. Change Log

    Requirements

  9. A SAS Viya 4 environment, preferably monthly stable 2024.03 or later

  10. A Visual Data Mining and Machine Learning (VDMML) license, usually provided with Viya Enterprise or higher, is required.

  11. An active SAS Cloud Analytics Services (CAS) connection during runtime.

  12. The smote.smoteSample CAS action requires Python configuration, as specified in SAS documentation. Please work with your SAS administrator to have the same configured. Ensure the following:

    1. Python 3.9.x required (dependent packages don't run on higher versions)
    2. sas-ipc-queue , version atleast 0.7.0 and beyond
    3. hnswlib

Parameters


Input Parameters

  1. Input table (input port, required): connect a CAS table to the input port.

  2. Nearest neighbors (numeric stepper, default 5): select the number of nearest neighbours to be used by the SMOTE algorithm as the basis for identifying candidate synthetic points.

  3. Input columns (column selector): select all inputs for the SMOTE process. You would also need to include the class and any nominal columns.

  4. Nominal variables (column selector): select any nominal variables you wish to use. Your nominal variables are required to be in the inputs column list.

  5. Select a class column (column selector, optional): select a column if you wish to use SMOTE in order to balance or augment a level within the class column. Be judicious in the choice of this column since a column with a high number of levels may slow down or even fail the process. Your class column is required to be in the inputs column list.

  6. Class to augment (drop-down list, values from class column if selected): select the level of the class variable you wish to augment. The values that appear here depend on the data that's contained in the class column, so may take time to populate based on actual data and number of levels.


Configuration

  1. Number of threads: (numeric stepper, optional): most of the time, you do not need to modify this. Change if you need to especially control the number of threads in which the process runs.

  2. Select a seed (numeric field, optional): specify a seed number to establish (but not completely guarantee) some level of reproducability with respect to results.

  3. Select extrapolation factor: specify a number (double) to use as a standard deviation in order to perturb (add noise or randomness) the input data boundaries.


Output Specification

  1. Number of synthetic observations (numeric field): specify the number of synthetic observations you would like in the output table.

  2. Output table (output port, option): attach a CAS table to the output port to hold results.


Run-time Control

Note: Run-time control is optional. You may choose whether to execute the main code of this step or not, based on upstream conditions set by earlier SAS programs. This includes nodes run prior to this custom step earlier in a SAS Studio Flow, or a previous program in the same session.

Refer this blog (https://communities.sas.com/t5/SAS-Communities-Library/Switch-on-switch-off-run-time-control-of-SAS-Studio-Custom-Steps/ta-p/885526) for more details on the concept.

The following macro variable,

_smt_run_trigger

will initialize with a value of 1 by default, indicating an "enabled" status and allowing the custom step to run.

If you wish to control execution of this custom step, include code in an upstream SAS program to set this variable to 0. This "disables" execution of the custom step.

To "disable" this step, run the following code upstream:

%global _smt_run_trigger;
%let _smt_run_trigger =0;

To "enable" this step again, run the following (it's assumed that this has already been set as a global variable):

%let _smt_run_trigger =1;

IMPORTANT: Be aware that disabling this step means that none of its main execution code will run, and any downstream code which was dependent on this code may fail. Change this setting only if it aligns with the objective of your SAS Studio program.


Documentation

  1. SAS documentation for the smote.smoteSample CAS action.

  2. PyPi page for sas-ipc-queue

  3. PyPi page for hnswlib


SAS Program

Refer here for the SAS program used by the step. You'd find this useful for situations where you wish to execute this step through non-SAS Studio Custom Step interfaces such as the SAS Extension for Visual Studio Code, with minor modifications.


Installation & Usage


Created/contact:

Acknowledgements to others for their help on details, testing or exploring the area: