dylan-profiler / visions

Type System for Data Analysis in Python
https://dylan-profiler.github.io/visions/visions/getting_started/usage/types.html
Other
203 stars 19 forks source link

is it possible to add country, currency, company as custom datatypes ? #143

Open seshurajup opened 3 years ago

seshurajup commented 3 years ago

How to extend the string datatype to subdomains like country, country code, currency for the finite domain values

sbrugman commented 3 years ago

@seshurajup extending the string data type to country, country code and currency is a great application of visions. (In the future they may even be included by default).

An example for country code (standardized as iso 3166 alpha-2 codes). There are multiple ways of defining your type. The example implementation below tests if all values are within a discrete set of country codes:

def is_len_2(series):
    return (series.str.len() == 2).all() and not series.hasnans

def is_alpha2(series):
    iso_3166_alpha_iso_2_codes = [
        'AF', 'AX', 'AL', 'DZ', 'AS', 'AD', 'AO', 'AI', 'AQ', 'AG', 'AR',
       'AM', 'AW', 'AU', 'AT', 'AZ', 'BS', 'BH', 'BD', 'BB', 'BY', 'BE',
       'BZ', 'BJ', 'BM', 'BT', 'BO', 'BQ', 'BA', 'BW', 'BV', 'BR', 'IO',
       'BN', 'BG', 'BF', 'BI', 'CV', 'KH', 'CM', 'CA', 'KY', 'CF', 'TD',
       'CL', 'CN', 'CX', 'CC', 'CO', 'KM', 'CG', 'CD', 'CK', 'CR', 'CI',
       'HR', 'CU', 'CW', 'CY', 'CZ', 'DK', 'DJ', 'DM', 'DO', 'EC', 'EG',
       'SV', 'GQ', 'ER', 'EE', 'SZ', 'ET', 'FK', 'FO', 'FJ', 'FI', 'FR',
       'GF', 'PF', 'TF', 'GA', 'GM', 'GE', 'DE', 'GH', 'GI', 'GR', 'GL',
       'GD', 'GP', 'GU', 'GT', 'GG', 'GN', 'GW', 'GY', 'HT', 'HM', 'VA',
       'HN', 'HK', 'HU', 'IS', 'IN', 'ID', 'IR', 'IQ', 'IE', 'IM', 'IL',
       'IT', 'JM', 'JP', 'JE', 'JO', 'KZ', 'KE', 'KI', 'KP', 'KR', 'KW',
       'KG', 'LA', 'LV', 'LB', 'LS', 'LR', 'LY', 'LI', 'LT', 'LU', 'MO',
       'MG', 'MW', 'MY', 'MV', 'ML', 'MT', 'MH', 'MQ', 'MR', 'MU', 'YT',
       'MX', 'FM', 'MD', 'MC', 'MN', 'ME', 'MS', 'MA', 'MZ', 'MM', 'NA',
       'NR', 'NP', 'NL', 'NC', 'NZ', 'NI', 'NE', 'NG', 'NU', 'NF', 'MK',
       'MP', 'NO', 'OM', 'PK', 'PW', 'PS', 'PA', 'PG', 'PY', 'PE', 'PH',
       'PN', 'PL', 'PT', 'PR', 'QA', 'RE', 'RO', 'RU', 'RW', 'BL', 'SH',
       'KN', 'LC', 'MF', 'PM', 'VC', 'WS', 'SM', 'ST', 'SA', 'SN', 'RS',
       'SC', 'SL', 'SG', 'SX', 'SK', 'SI', 'SB', 'SO', 'ZA', 'GS', 'SS',
       'ES', 'LK', 'SD', 'SR', 'SJ', 'SE', 'CH', 'SY', 'TW', 'TJ', 'TZ',
       'TH', 'TL', 'TG', 'TK', 'TO', 'TT', 'TN', 'TR', 'TM', 'TC', 'TV',
       'UG', 'UA', 'AE', 'GB', 'US', 'UM', 'UY', 'UZ', 'VU', 'VE', 'VN',
       'VG', 'VI', 'WF', 'EH', 'YE', 'ZM', 'ZW'
    ]
    return series.isin(iso_3166_alpha_iso_2_codes).all()

class CountryCode(VisionsBaseType):
    @classmethod
    def get_relations(cls):
        return [IdentityRelation(cls, String)]

    @classmethod
    def contains_op(cls, series, state):
        return is_len_2(series) and is_alpha2(series)

More information is available in the documentation (or comment below).

seshurajup commented 3 years ago

How i can contribute to visions, to incorporate year, country, country code, other fine list of file types as extension of string as you described. https://dylan-profiler.github.io/visions/visions/creator/contributing is broken