Closed cmutel closed 6 months ago
Probably not necessary but we might also do some unicode normalization as well.
@cmutel I think you are right that matches should be case insensitive. How about we convert while we initialize each Flow.name
so that we don't need extra match functions. Something like:
>>> import unicodedata
>>> def normalize_name(s):
... return unicodedata.normalize('NFC', s).lower()
...
>>> names = ["\u0075\u0308ber", "\u0055\u0308ber", "\u00FCber", "\u00DCber"]
>>> set(names)
{'Über', 'Über', 'über', 'über'}
>>> [name == 'über' for name in names]
[False, False, True, False]
>>> [normalize_name(name) == 'über' for name in names]
[True, True, True, True]
(My thumbs up is agreement, in case that wasn't clear)
@cmutel although the matching should be case-insensitive the return values for Flowmap.to_randonneur
and Flowmap.to_glad
should be the original name for the target flows, correct? For the source flow I'm assuming this is the case per our discussion in https://github.com/fjuniorr/flowmapper/issues/27
For example, should I return (1):
{
"source": {
"name": "1,4-Butanediol",
"categories": ["Air", "(unspecified)"],
"unit": "kg"
},
"target": {
"uuid": "09db39be-d9a6-4fc3-8d25-1f80b23e9131",
"name": "1,4-Butanediol", # upper case
"context": "air/unspecified",
"unit": "kg",
},
"conversion_factor": 1,
"comment": "Identical names",
}
or (2):
{
"source": {
"name": "1,4-Butanediol",
"categories": ["Air", "(unspecified)"],
"unit": "kg"
},
"target": {
"uuid": "09db39be-d9a6-4fc3-8d25-1f80b23e9131",
"name": "1,4-butanediol", # lower case
"context": "air/unspecified",
"unit": "kg",
},
"conversion_factor": 1,
"comment": "Identical names",
}
In some sense I could ask the same question about unit
when https://github.com/fjuniorr/flowmapper/issues/49 is done (eg. if the unit of the target flow was normalized, should I return the original or the normalized) and context
(which we could convert to lowercase instead of add case variations to CONTEXT_MAPPING
).
My feeling is that we want (2) for all the properties.
It might even be the case that for context we should return a list instead of '/'.join
(except for Flowmap.to_glad
which is a tabular representation).
We got 2 extra matches in agribalyse-3.1.1-biosphere
:
92681c17-2514-4151-aabf-baee3fd8e5f5
)fd950a9b-db9f-41c9-a8b0-d61f152f6ef2
) via the synonym "Dipropylthiocarbamic Acid S-Ethyl Ester"I've used the new FlowProperty
for handling Flow.name
and Flow.synonyms
.
It seems like name matching is case sensitive - is this really necessary? Maybe we can put case-insensitivity at the bottom of the list of functions to make sure it comes last?