alteryx / evalml

EvalML is an AutoML library written in python.
https://evalml.alteryx.com
BSD 3-Clause "New" or "Revised" License
760 stars 86 forks source link

Add documentation in EvalML to promote using Featuretools before EvalML and show how it will be used as part of the AutoML algorithm #2924

Open jeremyliweishih opened 2 years ago

jeremyliweishih commented 2 years ago

This issue tracks adding documentation on using featuretools before EvalML and explain how FeatureTools is used within EvalML as part of the AutoML algorithm. This should be done at the conclusion of the feature engineering project.

chukarsten commented 2 years ago

@jeremyliweishih not sure about the intent behind this story. Our understanding is that DefaultAlgorithm will be using Featuretools internally, so if the intent is to document what is happening within the DA, sure, makes sense. But where does the "promote using FT before EvalML" come from? Do we have a gap in documentation with the IterativeAlgorithm?

jeremyliweishih commented 2 years ago

@chukarsten #2919 tracks accepting engineered features into AutoMLSearch and then appending a DFSTransformer component to our pipelines. The promote using Featuretools before EvalML part of this issue is to add to the documentation using FT to engineer features before AutoMLSearch and then explaining our processes when a user goes down this path.

MarselScheer commented 2 weeks ago

I recently started with evalML and I tried to leverage the "features" parameter in AutoMLSearch. However, i failed to use it so far. The current documentation is only the description of the parameter (copied from https://evalml.alteryx.com/en/stable/autoapi/evalml/automl/index.html):

features (list) – List of features to run DFS on AutoML pipelines. Defaults to None. Features will only be computed if the columns used by the feature exist in the search input and if the feature itself is not in search input. If features is an empty list, the DFS Transformer will not be included in pipelines.

To me the documentation sound like i have to pass a list of column names i want to run DFS on. But i searched in your issue list and found this PR https://github.com/alteryx/evalml/pull/3309, where this parameter was added, and going through the code changes i think one has to apply FeatureTools first. This is from the unit-tests:

_, features = ft.dfs(
        entityset=es, target_dataframe_name="X", trans_primitives=["absolute"]
)

In that case, the documentation should provide more information on how to use it, probably an example would be best.

Anyway, even if it is correct that ft.dfs() needs to be applied upfront I am still a bit confused because AutoMLSearch expects a pandas dataframe but dfs can work with entityset which is much richer! So currently my impression is that if i want to use the "feature" parameter from AutoMLSearch, then i have to apply FeatureTools upfront to my data BUT ONLY to the dataset that I pass to the "X_train" parameter of AutoMLSearch.