Closed peterdesmet closed 11 years ago
Narwhal should take care of this. We should use uppercase, no accents, and no hyphens in our dictionaries. I will validate the behavior and update the documentation.
I would also ignore single quotes (as in Province d'Anvers
), as they can be written in different ways: '‘’‹›
. I would suggest to replace single quotes with nothing, as apposed to hypens, which I would replace with a space:
PROVINCE D'ANVERS -> PROVINCE DANVERS
BRUSSELS-CAPITAL REGION -> BRUSSELS CAPITAL REGION
:snowflake:
Rules are up-to-date : https://github.com/Canadensys/narwhal-processor/wiki/How-to-contribute The only difference with the suggestion is that hyphens are removed like all other punctuation to avoid exception based on characters. The dictionary look like this :
REGION DE BRUXELLES-CAPITALE BRU
REGION DE BRUXELLES CAPITALE BRU
You have the punctuated version and the version with a space. The punctuated version will accept all punctuation.
Christian - could you clarify what is the relationship between the dictionary and eventual display of data that are aggregated using these dictionaries? In other words, what will a user see for StateProvince if the source data contains variants of e.g. Québec? Will these be presented according to a standard (if it exists)? Would that be some sort of "master" dictionary of controlled terms that is mapped to "QC"?
The goal of the processor is to 'understand' a word, in this case a province of Canada. So the dictionary in used to known that 'Québec' means 'QC' where 'QC' is a controlled word from an enum. Then what is displayed is up to the caller. You could use the English name as defined in the enum or simply create your own mapping for your own language.
Looking at this dictionary file it seems the processor is case-insensitive, but not accent (
¨^´ etc
) and hyphen (- –
) agnostic.I think it should be, otherwise a lot of time will be invested in creating dictionaries that can handle all cases, while the processor can do this easily.
This is sensible:
This is not:
If the narwhal can already deal with this: great! But then we should update the documentation and only use uppercase, no accents, and no hyphens in our dictionaries.