alteryx / woodwork

Woodwork is a Python library that provides robust methods for managing and communicating data typing information.
https://woodwork.alteryx.com
BSD 3-Clause "New" or "Revised" License
145 stars 20 forks source link

Auto-initialize Woodwork #1095

Open jeff-hernandez opened 3 years ago

jeff-hernandez commented 3 years ago

As a user, I think it would be helpful to auto-initialize Woodwork when using the accessor. DataFrames contain enough information to create an initial schema based on the data types. There are several methods to update the schema after initializing.

Additionally, I think this behavior is similar to pandas behavior which does not require data type information to initialize a DataFrame -- you can modify the data types afterward. One possible drawback is losing the option to provide typing information before initializing, but it might be more of a pain point to make this initialization a requirement for the user.

As a developer, I think it would help simplify the code since it wouldn't require adding an initialization check to most (if not all?) of the methods in the Woodwork accessor. There may be some points I've overlooked here, but I think this is something worth considering.

gsheni commented 3 years ago

@jeff-hernandez What would the code example of this look like?

import pandas as pd
import woodwork as ww

df = pd.read_csv("https://api.featurelabs.com/datasets/online-retail-logs-2018-08-28.csv")
# dataframe gets auto-initialized without calling `init`? 
jeff-hernandez commented 3 years ago

Entering the accessor for the first time would initialize Woodwork automatically instead of raising an error.

import pandas as pd
import woodwork as ww

df = pd.read_csv("https://api.featurelabs.com/datasets/online-retail-logs-2018-08-28.csv")  # not initialized
df.ww  # auto-initialized
                Physical Type Logical Type Semantic Tag(s)
Column                                                    
order_id             category  Categorical    ['category']
product_id           category  Categorical    ['category']
description          category  Categorical    ['category']
quantity                int64      Integer     ['numeric']
order_date     datetime64[ns]     Datetime              []
unit_price            float64       Double     ['numeric']
customer_name        category  Categorical    ['category']
country              category  Categorical    ['category']
total                 float64       Double     ['numeric']
cancelled                bool      Boolean              []

In contrast to the current behavior, entering the accessor raises an error.

import pandas as pd
import woodwork as ww

df = pd.read_csv("https://api.featurelabs.com/datasets/online-retail-logs-2018-08-28.csv") 
df.ww
WoodworkNotInitError: Woodwork not initialized for this DataFrame. Initialize by calling DataFrame.ww.init