alteryx / evalml

EvalML is an AutoML library written in python.
https://evalml.alteryx.com
BSD 3-Clause "New" or "Revised" License
784 stars 87 forks source link

ComponentGraph `describe()` does not differentiate between duplicate components by name #2735

Open angela97lin opened 3 years ago

angela97lin commented 3 years ago

When there are duplicate components, it is difficult to understand which component is being referenced by describe().

# Using a more involved component graph with more complex edges
component_dict = {
        "Imputer": ["Imputer", "X", "y"],
        "Target Imputer": ["Target Imputer", "X", "y"],
        "OneHot_RandomForest": ["One Hot Encoder", "Imputer.x", "Target Imputer.y"],
        "OneHot_ElasticNet": ["One Hot Encoder", "Imputer.x", "y"],
        "Random Forest": ["Random Forest Classifier", "OneHot_RandomForest.x", "y"],
        "Elastic Net": ["Elastic Net Classifier", "OneHot_ElasticNet.x", "Target Imputer.y"],
        "Logistic Regression": [
            "Logistic Regression Classifier",
            "Random Forest.x",
            "Elastic Net.x",
            "y",
        ],
}
cg_with_estimators = ComponentGraph(component_dict)
cg_with_estimators.instantiate({})
cg_with_estimators.describe()

returns:

1. Imputer
     * categorical_impute_strategy : most_frequent
     * numeric_impute_strategy : mean
     * categorical_fill_value : None
     * numeric_fill_value : None
2. Target Imputer
     * impute_strategy : most_frequent
     * fill_value : None
3. One Hot Encoder
     * top_n : 10
     * features_to_encode : None
     * categories : None
     * drop : if_binary
     * handle_unknown : ignore
     * handle_missing : error
4. One Hot Encoder
     * top_n : 10
     * features_to_encode : None
     * categories : None
     * drop : if_binary
     * handle_unknown : ignore
     * handle_missing : error
5. Random Forest Classifier
     * n_estimators : 100
     * max_depth : 6
     * n_jobs : -1
6. Elastic Net Classifier
     * penalty : elasticnet
     * C : 1.0
     * l1_ratio : 0.15
     * n_jobs : -1
     * multi_class : auto
     * solver : saga
7. Logistic Regression Classifier
     * penalty : l2
     * C : 1.0
     * n_jobs : -1
     * multi_class : auto
     * solver : lbfgs

There are two OHE in the graph but because we print the official component name, we do not know which OHE is referring to what (3 + 4). We should consider using or appending the name of the component as referenced by the ComponentGraph.

dsherry commented 3 years ago

Yeah, not ideal. Almost makes me wonder if we should deprecate describe(). Users can get details from pipeline parameters and component graph / graphviz visualization. Thoughts?