alteryx / woodwork

Woodwork is a Python library that provides robust methods for managing and communicating data typing information.
https://woodwork.alteryx.com
BSD 3-Clause "New" or "Revised" License
144 stars 20 forks source link

[Feature Request] Smarter variable type inference #1141

Open rohan-gt opened 4 years ago

rohan-gt commented 4 years ago

Is it possible to add smarter variable type inference to detect all the different variable types that Featuretools supports like PhoneNumber, ZipCode etc. using regex or other rules?

kmax12 commented 4 years ago

this is something we've been thinking about. i think it breaks into 2 categories, both of which would be valuable

  1. validation - does the data match the variable type
  2. inference - can we automatically determine and parse the correct type without user input

i think validation might be a bit easier to start with and would give us a starting point to think about how to do inference

rohan-gt commented 4 years ago

Hmm validation could be done based on a random sample or to be more thorough scan all values and using a confidence threshold although the latter might be computationally expensive