EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.73k stars 1.57k forks source link

TPOT Config, Terminals are required to have a unique name. #1238

Open mattiacampana opened 2 years ago

mattiacampana commented 2 years ago

Hi, I'm trying to use TPOT specifying classifiers and pre-processing techniques to evaluate in the config dictionary. However, when I run the fit(), I get the following error:

AssertionError: Terminals are required to have a unique name. Consider using the argument 'name' to rename your second PCA__svd_solver=l terminal.

This is the first time with TPOT...where I'm doing something wrong?

Context of the issue

This is the come I'm using:

MANUAL_features_indices = ";".join([str(i) for i in range(477)])
VGGISH_features_indices = ";".join([str(i) for i in range(477, 477+256)])
L3_features_indices = ";".join([str(i) for i in range(477+256, 477+256+1024)])

FEATURES_LIST = [MANUAL_features_indices, VGGISH_features_indices, L3_features_indices]

FEATURES_SUBSETS = []
for L in range(0, len(FEATURES_LIST)+1):
    for subset in itertools.combinations(FEATURES_LIST, L):
        if len(subset) != 0:
            FEATURES_SUBSETS.append(list(subset))

tpot_config = {

    "tpot.builtins.FeatureSetSelector": {
        "subset_list": FEATURES_LIST,
        "sel_subset": FEATURES_SUBSETS
    },

    "sklearn.preprocessing.StandardScaler": { },

    "sklearn.preprocessing.Normalizer": {
        "norm": ["l1", "l2", "max"]
    },

    "sklearn.decomposition.PCA": {
        "n_components": [.7, .8, .9, .95, .99],
        "svd_solver": "full"
    },

    "sklearn.svm.SVC": {
        "C": [.01, .1, 1, 10, 100],
        "gamma": [100, 10, 1, .1, .01, .001, "scale", "auto"],
        "kernel": ["rbf", "poly", "sigmoid"],
        "degree": [2, 3, 4, 5, 6],
        "probability": [True],
        "class_weight": ["balanced", None],
     },

    "sklearn.ensemble.RandomForestClassifier": {
        "n_estimators": [10, 20, 50, 100, 200, 500, 1000],
        "min_samples_split": [2, 6, 8, 10, 12, 20],
        "max_depth": [10, 20, 30, 50, 100, 150, 200, None],
        "criterion": ["entropy", "gini"],
        "max_features": ["auto", "sqrt", "log2"],
        "class_weight": ["balanced", None]
     },

    "sklearn.linear_model.LogisticRegression": {
        "penalty": ["none", "l2"],
        "solver": ["newton-cg", "sag", "saga", "lbfgs"],
        "C": np.logspace(-3, 3, 100),
        "max_iter": [300000]
    },

    "sklearn.ensemble.AdaBoostClassifier": {
        "n_estimators": [10, 20, 50, 100, 200, 500, 1000],
        "base_estimator": [SVC(probability = True), LogisticRegression(), None],
        "learning_rate": [10, 5, 1, .5, .1, .05, .01, .001],
        "algorithm": ["SAMME", "SAMME.R"]
    },

    "sklearn.neural_network.MLPClassifier": {
        "activation": ["relu", "tanh", "logistic", "identity"],
        "solver": ["lbfgs", "sgd", "adam"],
        "alpha": [1e-6, 1e-5, 1e-4, 1e-3],
        "batch_size": [16, 32, 64],
        "shuffle": [True],
        "learning_rate": ["constant", "invscaling", "adaptive"],
        "max_iter": [10000],
        "early_stopping": [True],
        "random_state": [42],
        "validation_fraction": [.1, .2]
    }
}

pipeline_optimizer = TPOTClassifier(
    random_state=23,
    generations=5,
    population_size=100,
    scoring="roc_auc",
    cv=RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1),
    subsample=0.1,
    n_jobs=-1,
    verbosity=3,
    periodic_checkpoint_folder="tpot_kcl.txt",
    config_dict=tpot_config
)

pipeline_optimizer.fit(X_train, y_train)

As you can note, I have 3 main sets of features (manual, VGGISH, and L3) and I would like to test different combinations of them. Then, I would like to apply the PCA with a different number of components, and finally test 5 classifiers: SVM, Random Forest, AdaBoost, and MLP.

spenceforce commented 2 years ago

Just ran into this myself. The problem is one of your config options is repeated, creating a duplicate. Config options are iterated over, creating terminals for each option so the string "full" in PCA__svd_solver is iterated over and "l" is added twice. Putting "full" in a list should fix the problem.

  "sklearn.decomposition.PCA": {
      "n_components": [.7, .8, .9, .95, .99],
-     "svd_solver": "full"
+     "svd_solver": ["full"]
  },