new feature request - automatically fit multiple variables

jkmackie commented 1 year ago

I recommend dfit.fit_transform(X) be extended to include multiple variables. Each variable will be fitted individually.

matrix rows = samples matrix columns = features (variables)

feature_matrix

The proposed functionality mirrors the popular scikit-learn API. Here is an example of that API: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

Also, parallel processing across a multi-core CPU would be an awesome enhancement! :-)

Guillaume Lemaitre (https://github.com/glemaitre) committed code for sklearn.utils.parallel. He is a developer for the scikit-learn foundation. He may be a good contact on how best to implement parallel processing in Python in 2023.

erdogant commented 1 year ago

Great suggestion. I looked into it but it is quite some work to integrate such an approach. At the moment all functions (figures, predict etc) are designed for single-column univariate and not multi-column univariate. I will put this on my never-ending-always-getting-longer-todo-list.

jkmackie commented 1 year ago

Thank you for the reply!

In the meantime, here is starter code to run a distfit exploratory data analysis with multiple cores. Pandas DataFrames are used for readability. (Code can be easily tweaked to use numpy instead.)

The illustration below uses a numeric-only dataset called Company Bankruptcy Prediction. It has 6819 rows and 96 columns.

Note: Error-handling is required to run distfit on this dataset. Certain columns will error out--with or without parallel processing.

import numpy as np
import pandas as pd
import re
from distfit import distfit
from joblib import Parallel, delayed
import collections
pd.options.display.max_columns = 100

# Numeric-only data from here (sign-in required):  
# https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction/download?datasetVersionNumber=2

#----------------------------------------------------------------------------------
#Clean up column names and lower memory.
#----------------------------------------------------------------------------------
df = pd.read_csv("./data.csv")
for c in df.columns:  #clean up column names
    no_beg_end_spaces = c.strip()
    result = re.sub(r"\s+", "_", no_beg_end_spaces)
    df.rename(columns={c : result}, inplace=True)

print('df shape:', df.shape)
display(df.tail(3))

for c in df.columns:
    df[c] = pd.to_numeric(df[c], downcast='float')

#----------------------------------------------------------------------------------
#Use joblib to run distfit on CPU cores in parallel.
#----------------------------------------------------------------------------------
chunks = np.array_split(df, len(df.columns), axis=1)  #chunks are one column due to univariate constraint.
display(chunks[0].head())
display(chunks[1].head())

def get_distfit(series):
    try:        
        result = dfit.fit_transform(series.values, verbose=30)
        return result['model']['name']
    except:
        return 'ERROR'

dfit = distfit()
with Parallel(n_jobs=-2, prefer="processes") as parallel:
    results = parallel(delayed(get_distfit)(chunk) for chunk in chunks)

display(list(zip(df.columns, results))[0:5])  #show best distribution by column
display(sorted(collections.Counter(results).items(), key=lambda x:x[1], reverse=True))

#----------------------------------------------------------------------------------
#Get best distribution one column at a time (slower than parallel run).
#----------------------------------------------------------------------------------
sequential_outputs = []
for chunk in chunks:
    sequential_outputs.append(get_distfit(chunk))
display(list(zip(df.columns, sequential_outputs))[0:5])  #show best distribution by column
display(sorted(collections.Counter(sequential_outputs).items(), key=lambda x:x[1], reverse=True))

erdogant / distfit

new feature request - automatically fit multiple variables #27