alteryx / woodwork

Woodwork is a Python library that provides robust methods for managing and communicating data typing information.
https://woodwork.alteryx.com
BSD 3-Clause "New" or "Revised" License
140 stars 19 forks source link

Add a logical type Currency #769

Open dsherry opened 3 years ago

dsherry commented 3 years ago

Currently woodwork will detect data of the form

$1.234
$5.678
...

as "Natural Language". It would be helpful if we created a currency type so that this sort of data gets picked up as numeric by default.

BoopBoopBeepBoop commented 3 years ago

Which currencies? Currency math typically operates using decimals. Would these types use fixed point storage?

thehomebrewnerd commented 3 years ago

Building on the questions raised by @BoopBoopBeepBoop , we would also need to think about the different global currency symbols used as well as different usage of commas and decimal points around the world unless we wanted this to be a US-only inference.

dsherry commented 3 years ago

@BoopBoopBeepBoop yeah good questions! I think the physical type for logical type "Currency" should be a float.

@thehomebrewnerd could something like this library help with parsing and with internationalization?

I do think there's value to be had in starting with one localization (e.g. USD), to get the feature working, and to then add support for other localizations.

It occurs to me, it would be cool if woodwork tagged currency columns with the name of the currency they use!

gsheni commented 3 years ago

If a user passes a column with datetime values [12/12/12, 1/1/1, 2/2/2], we will infer/change theses values to be a datetime and add Hour, Minute, Second. So passing a column with currency symbols, changing to a Float, and storing the currency information elsewhere is not a radically different behavior.

willsmithorg commented 2 years ago

In the finance industry its rare to have a numerical value and currency stored in a single feature.

Instead normally you'd have something like Dividend_currency. (iso4217 3 letter code) Dividend_amount (numeric)

Dozens of countries use "dollar" as their unit so "$300" is very ambiguous about which currency the user means.

There would be value in 1) creating a logical woodwork feature to store and detect currency codes (iso4217 symbols) 2) making a feature primitive in FeatureTools that detects column pairs like the above and automatically converting to USD using a recent exchange rate. That would permit rows in different currencies to be compared more fairly based on value. For example, 14000 IDR is around 1 USD is unless we convert to USD the IDR value looks like something "high value".

I am happy to work on (1). I'll open a new feature for this.

gsheni commented 2 years ago

@willsmithorg You would have to tag the numeric column, perhaps with a currency tag to specify that its currency values. This semantic tag (currency) could work with Double OR Integer. Once you have this, the Featuretools primitive would be defined as such

class CurrencyToUSD(TransformPrimitive):
    name = "currency_to_usd"
    input_types = [ColumnSchema(semantic_tags={'currency', 'numeric'}), ColumnSchema(logical_type=CurrencyCode)]
    return_type = ColumnSchema(semantic_tags={'numeric'})
gsheni commented 2 years ago

Another way to tackle this problem is to turn currency into a tuple of values, similar to LatLong. Each value could then have a corresponding currency symbol/code.

thehomebrewnerd commented 1 year ago

A related article that might be useful when working on a solution for this: https://cs-syd.eu/posts/2022-08-22-how-to-deal-with-money-in-software