Open jkmackie opened 1 year ago
Great suggestion. I looked into it but it is quite some work to integrate such an approach. At the moment all functions (figures, predict etc) are designed for single-column univariate and not multi-column univariate. I will put this on my never-ending-always-getting-longer-todo-list.
Thank you for the reply!
In the meantime, here is starter code to run a distfit exploratory data analysis with multiple cores. Pandas DataFrames are used for readability. (Code can be easily tweaked to use numpy instead.)
The illustration below uses a numeric-only dataset called Company Bankruptcy Prediction. It has 6819 rows and 96 columns.
Note: Error-handling is required to run distfit on this dataset. Certain columns will error out--with or without parallel processing.
import numpy as np
import pandas as pd
import re
from distfit import distfit
from joblib import Parallel, delayed
import collections
pd.options.display.max_columns = 100
# Numeric-only data from here (sign-in required):
# https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction/download?datasetVersionNumber=2
#----------------------------------------------------------------------------------
#Clean up column names and lower memory.
#----------------------------------------------------------------------------------
df = pd.read_csv("./data.csv")
for c in df.columns: #clean up column names
no_beg_end_spaces = c.strip()
result = re.sub(r"\s+", "_", no_beg_end_spaces)
df.rename(columns={c : result}, inplace=True)
print('df shape:', df.shape)
display(df.tail(3))
for c in df.columns:
df[c] = pd.to_numeric(df[c], downcast='float')
#----------------------------------------------------------------------------------
#Use joblib to run distfit on CPU cores in parallel.
#----------------------------------------------------------------------------------
chunks = np.array_split(df, len(df.columns), axis=1) #chunks are one column due to univariate constraint.
display(chunks[0].head())
display(chunks[1].head())
def get_distfit(series):
try:
result = dfit.fit_transform(series.values, verbose=30)
return result['model']['name']
except:
return 'ERROR'
dfit = distfit()
with Parallel(n_jobs=-2, prefer="processes") as parallel:
results = parallel(delayed(get_distfit)(chunk) for chunk in chunks)
display(list(zip(df.columns, results))[0:5]) #show best distribution by column
display(sorted(collections.Counter(results).items(), key=lambda x:x[1], reverse=True))
#----------------------------------------------------------------------------------
#Get best distribution one column at a time (slower than parallel run).
#----------------------------------------------------------------------------------
sequential_outputs = []
for chunk in chunks:
sequential_outputs.append(get_distfit(chunk))
display(list(zip(df.columns, sequential_outputs))[0:5]) #show best distribution by column
display(sorted(collections.Counter(sequential_outputs).items(), key=lambda x:x[1], reverse=True))
I recommend dfit.fit_transform(X) be extended to include multiple variables. Each variable will be fitted individually.
matrix rows = samples matrix columns = features (variables)
The proposed functionality mirrors the popular scikit-learn API. Here is an example of that API: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
Also, parallel processing across a multi-core CPU would be an awesome enhancement! :-)
Guillaume Lemaitre (https://github.com/glemaitre) committed code for sklearn.utils.parallel. He is a developer for the scikit-learn foundation. He may be a good contact on how best to implement parallel processing in Python in 2023.