[BUG] `Issue with custom metric in scikit-learn when using multiple processes`

SimplifytradingAI commented 2 months ago

When I define a function for a custom metric in sklearn and use GridserchCV in multiprocessing the software just doesn't go, no errors, nothing. If I don't use a customer metric but use the standard ones it works. In addition, if I use a custom metric but run in single core the software works.

I believe there is an issue with joblib (used by sklearn for multiprocessing), once I have seen something about "Could not pickle the task to send it to the workers." in joblib but I can't replicate it.

steps to replicate the issue python == 3.11.9 scikit-kearn == 1.5.0 pandas == 2.2.2 pyarmor == 8.5.11

pyarmor command used: pyarmor gen dummy.py

In the code below simply change "n_jobs=4" to "n_jobs=1" to make it work

import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import GridSearchCV, StratifiedKFold from sklearn.datasets import make_classification from sklearn.metrics import make_scorer, accuracy_score

def my_f(y_true, y_pred): N = y_true.shape[0] return (y_true == y_pred).sum() / N

X, y = make_classification(n_samples=1000, n_features=10_000, n_classes=2, random_state=123) X = pd.DataFrame(data=X, columns=[f"F{i}" for i in range(10_000)]) y = pd.Series(data=y, name="Labels")

params = { "n_estimators": (8, 11, 15, 20, 25), "criterion": ["gini"], "max_depth": (3, 4, 5), "min_samples_split": (2, 4), "min_samples_leaf": (1, 3), "min_weight_fraction_leaf": [0.0], "max_features": ["sqrt"], "max_leaf_nodes": [None], "min_impurity_decrease": [0.0], "bootstrap": [True], "oob_score": [False], "n_jobs": [1], "random_state": [123], "verbose": [0], "warm_start": [False], "class_weight": ("balanced", None), "ccp_alpha": [0.0], "max_samples": [None] }

grid = GridSearchCV(estimator=RandomForestClassifier(), param_grid=params, scoring=make_scorer(my_f), cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=123), refit=True, verbose=10, error_score="raise", return_train_score=True, n_jobs=4) grid.fit(X=X, y=y, sample_weight=None) print(grid.bestscore)

SimplifytradingAI commented 2 months ago

I have just found a way to make it work (although not sure why this is the case). If I create a new file and put the function "my_f" there and in the other file I use "from functions import my_f" it works.

then I use "pyarmor gen src", where src is the name of the folder where the 2 files are located.

Any idea why this makes it work?

jondy commented 2 months ago

No idea, maybe it's about sys._getframe issue, please refer to https://pyarmor.readthedocs.io/en/latest/how-to/third-party.html

dashingsoft / pyarmor

[BUG] `Issue with custom metric in scikit-learn when using multiple processes` #1921