Canadensys / narwhal-processor

Basic data processing library aiming to normalize similar values ​​in a known format.
MIT License
6 stars 7 forks source link

Processor should be case, accent and hyphen agnostic #2

Closed peterdesmet closed 11 years ago

peterdesmet commented 11 years ago

Looking at this dictionary file it seems the processor is case-insensitive, but not accent (¨^´ etc) and hyphen (- –) agnostic.

I think it should be, otherwise a lot of time will be invested in creating dictionaries that can handle all cases, while the processor can do this easily.

This is sensible:

REGION DE BRUXELLES CAPITALE    BRU
REGION DE BRUXELLES CAPITAL BRU
REGION BRUXELLES CAPITALE   BRU
REGION BRUXELLES CAPITAL        BRU
BRUXELLES   BRU
BRUXELLE    BRU

This is not:

RÉGION DE BRUXELLES-CAPITALE   BRU
REGION DE BRUXELLES-CAPITALE    BRU
RÉGION DE BRUXELLES CAPITALE   BRU
REGION DE BRUXELLES CAPITALE    BRU
RÉGION BRUXELLES-CAPITALE  BRU
REGION BRUXELLES-CAPITALE   BRU
RÉGION BRUXELLES CAPITALE  BRU
REGION BRUXELLES CAPITALE   BRU
etc.

If the narwhal can already deal with this: great! But then we should update the documentation and only use uppercase, no accents, and no hyphens in our dictionaries.

cgendreau commented 11 years ago

Narwhal should take care of this. We should use uppercase, no accents, and no hyphens in our dictionaries. I will validate the behavior and update the documentation.

peterdesmet commented 11 years ago

I would also ignore single quotes (as in Province d'Anvers), as they can be written in different ways: '‘’‹›. I would suggest to replace single quotes with nothing, as apposed to hypens, which I would replace with a space:

PROVINCE D'ANVERS -> PROVINCE DANVERS
BRUSSELS-CAPITAL REGION -> BRUSSELS CAPITAL REGION

:snowflake:

cgendreau commented 11 years ago

Rules are up-to-date : https://github.com/Canadensys/narwhal-processor/wiki/How-to-contribute The only difference with the suggestion is that hyphens are removed like all other punctuation to avoid exception based on characters. The dictionary look like this :

REGION DE BRUXELLES-CAPITALE    BRU
REGION DE BRUXELLES CAPITALE    BRU

You have the punctuated version and the version with a space. The punctuated version will accept all punctuation.

dshorthouse commented 11 years ago

Christian - could you clarify what is the relationship between the dictionary and eventual display of data that are aggregated using these dictionaries? In other words, what will a user see for StateProvince if the source data contains variants of e.g. Québec? Will these be presented according to a standard (if it exists)? Would that be some sort of "master" dictionary of controlled terms that is mapped to "QC"?

cgendreau commented 11 years ago

The goal of the processor is to 'understand' a word, in this case a province of Canada. So the dictionary in used to known that 'Québec' means 'QC' where 'QC' is a controlled word from an enum. Then what is displayed is up to the caller. You could use the English name as defined in the enum or simply create your own mapping for your own language.