alteryx / evalml

EvalML is an AutoML library written in python.
https://evalml.alteryx.com
BSD 3-Clause "New" or "Revised" License
764 stars 86 forks source link

feature request: Improve the description of steps of pipelines #2491

Open naveen-marthala opened 3 years ago

naveen-marthala commented 3 years ago

Improve the description of pipeline.

from this:

>>> from evalml import AutoMLSearch
>>> automl = AutoMLSearch(X_train=Xy_3.drop(columns='target'), y_train=Xy_3.loc[:,'target'],
                                                 problem_type='regression', objective='root mean squared error', n_jobs=-1)
>>> automl.search()
>>> print(automl.describe_pipeline(automl.rankings.iloc[0]["id"]))
*********************************************************************************
* XGBoost Regressor w/ Imputer + Text Featurization Component + One Hot Encoder *
*********************************************************************************

Problem Type: regression
Model Family: XGBoost

Pipeline Steps
==============
1. Imputer
     * categorical_impute_strategy : most_frequent
     * numeric_impute_strategy : median
     * categorical_fill_value : None
     * numeric_fill_value : None
2. Text Featurization Component
3. One Hot Encoder
     * top_n : 10
     * features_to_encode : None
     * categories : None
     * drop : if_binary
     * handle_unknown : ignore
     * handle_missing : error
4. XGBoost Regressor
     * eta : 0.053613563977506225
     * max_depth : 20
     * min_child_weight : 6.470576304373694
     * n_estimators : 688
     * n_jobs : -1

to something like below:

for Steps 1 and 2:

1. Imputer
     * categorical_impute_strategy : most_frequent [categorical columns imputed: 'column_1', 'column_2', 'column_3', 'column_4']
     * numeric_impute_strategy : median [numeric columns imputed: 'column_5', 'column_6', 'column_7', 'column_8']
     * categorical_fill_value : None [the likes of above]
     * numeric_fill_value : None [the likes of above]
2. Text Featurization Component
         * Polarity_Score: [columns calculated polarity for: 'column_9', 'column_10', 'column_11', 'column_12']
         * Diversity_Score: [the likes of above]
         * LSA: [the likes of above]

Step 3: From what I have noticed till now, all the categories in columns with <10 unique categories even when the descriptions said top_n=10. Certain categories have been dropped from OHEd columns, for reasons like, when certain category occur for less than 0.5% of the data or only 10 rows. And I don't know, if this is happening inside the pipeline search, are all the categories of a column that doesn't end up with a separate column, being pushed into a column called something like Column_others?

Step 4: This information about the model isn't enough to reproduce the results sometimes. So, please display all the parameters and not just the ones you set. almost all the models from popular packages now a days have a method get_params().

image

chukarsten commented 3 years ago

Thanks @naveen-marthala , we'll take a look at this and talk about it.