alteryx / evalml

EvalML is an AutoML library written in python.
https://evalml.alteryx.com
BSD 3-Clause "New" or "Revised" License
772 stars 86 forks source link

Change `_schema_is_equal` check to `_schema_is_compatible` and use training schema for predict data #4133

Open tamargrey opened 1 year ago

tamargrey commented 1 year ago

Currently, at ComponentGraph._transform_features, when the graph is not already fit, we do a check for whether or not X's woodwork schema is equal to the ComponentGraph.input_types. If the types do not match, we raise a PipelineError. We do this, because having different types at train vs predict can cause unpredictable and confusing errors in our components.

However, this way of checking for and handling unequal schemas can be problematic. The first reason is just that the error message, Input X data types are different from the input types the pipeline was fitted on. isn't very detailed, and the details are just Woodwork.TableSchema.types, which doesn't contain information like feature origins or the woodwork metadata, making debugging this error difficult. The second problem is that checking for schema equality is too restrictive. There are cases when the data may have slightly different woodwork types inferred, but the data is inherently still compatible with the original types, so we shouldn't need to raise an error. Examples of this are if null values are present, causing data that was originally Integer to be IntegerNullable, for example, or if a column that was Categorical gets inferred as Unknown once there's a much smaller dataset at predict.

We should change this logic to be more permissive of these types of changes as long as the data is still compatible with the original types and improve the description of what is different between the schemas.

To do this, we can:

  1. Check for woodwork schema equality and warn if there are any differences - whether there are different columns present or logical types are different or woodwork metadata or anything else. We need a better way to describe the difference between woodwork schemas.
  2. If the schemas are not equal, attempt to initialize X with the ComponentGraph.input_types via X.ww.init(schema=self._input_types). As long as the new data is compatible with the original schema, this will work. If some columns have been lost or logical types are incompatible with the data, a woodwork error will be raised. We can then catch that to raise our own error if we'd like.

Note 1: There is some logic that relates to the dfs transformer at this step - if it is present in the graph, we only check the equality of the non engineered features. This logic will still be needed, and improving the descriptions of the difference between two woodwork schemas will make bugs around this logic easier to understand (aka if we don't maintain feature origins, causing there to be different sets of columns in the schemas, we can see it!).

Note 2: We should complete https://github.com/alteryx/evalml/issues/4077 as part of this implementation. It will require that we override the logical types from self._input_types with the nullable types if theyre present in X and not the corresponding column in self._input_types

tamargrey commented 1 year ago

Opened https://github.com/alteryx/woodwork/issues/1670 which will be necessary to properly display the differences in woodwork typing info to users.