Intelligent Feature Engineering based on column name

Considering that many use cases exist which leverage data that is commonly integrated throughout their respective domains, such as debt and income levels for financial services, or phase trials and patients recruited for biotech, it would help to have the ability to automatically identify and engineer features based on their data type and/or name.

For example:

A "Name" category identified in a dataset would automatically be put through a process whereby "Mr.", "Miss", "Mrs." etc would help generate new features based on gender.
An "Income" (as opposed to "Expenditures") category would expect values to be 0 or above, and would automatically drop values below 0 or would cap them at 0.
An "Education" category would provide Ordinal Encoding to assign numerical values to expected educational levels, such as 0 for drop out, 1 for associates degree, 2 for bachelors degree, etc.

Additionally, aggregating alternative names for common features could help with this, like taking "Line of Credit", "LOC", "Credit Line" and changing them to a singular default name that can then be put through the above process.

This would need a design document to determine the scope and default use cases to initially build for. A follow up issue could be filed for integrating it into AutoML to provide automated feature engineering for datasets that have matching column names to the ones we're looking for.

alteryx / evalml

Intelligent Feature Engineering based on column name #2010