fjuniorr / flowmapper

Mappings between elementary flows
MIT License
0 stars 1 forks source link

Add `name` field with case-insensitive matching? #46

Closed cmutel closed 6 months ago

cmutel commented 6 months ago

It seems like name matching is case sensitive - is this really necessary? Maybe we can put case-insensitivity at the bottom of the list of functions to make sure it comes last?

cmutel commented 6 months ago

Probably not necessary but we might also do some unicode normalization as well.

fjuniorr commented 6 months ago

@cmutel I think you are right that matches should be case insensitive. How about we convert while we initialize each Flow.name so that we don't need extra match functions. Something like:

>>> import unicodedata
>>> def normalize_name(s):
...     return unicodedata.normalize('NFC', s).lower()
... 
>>> names = ["\u0075\u0308ber", "\u0055\u0308ber", "\u00FCber", "\u00DCber"]
>>> set(names)
{'Über', 'Über', 'über', 'über'}
>>> [name == 'über' for name in names]
[False, False, True, False]
>>> [normalize_name(name) == 'über' for name in names]
[True, True, True, True]
cmutel commented 6 months ago

(My thumbs up is agreement, in case that wasn't clear)

fjuniorr commented 6 months ago

@cmutel although the matching should be case-insensitive the return values for Flowmap.to_randonneur and Flowmap.to_glad should be the original name for the target flows, correct? For the source flow I'm assuming this is the case per our discussion in https://github.com/fjuniorr/flowmapper/issues/27

For example, should I return (1):

{
    "source": {
        "name": "1,4-Butanediol",
        "categories": ["Air", "(unspecified)"],
        "unit": "kg"
    },
    "target": {
        "uuid": "09db39be-d9a6-4fc3-8d25-1f80b23e9131",
        "name": "1,4-Butanediol", # upper case
        "context": "air/unspecified",
        "unit": "kg",
    },
    "conversion_factor": 1,
    "comment": "Identical names",
}

or (2):

{
    "source": {
        "name": "1,4-Butanediol",
        "categories": ["Air", "(unspecified)"],
        "unit": "kg"
    },
    "target": {
        "uuid": "09db39be-d9a6-4fc3-8d25-1f80b23e9131",
        "name": "1,4-butanediol", # lower case
        "context": "air/unspecified",
        "unit": "kg",
    },
    "conversion_factor": 1,
    "comment": "Identical names",
}

In some sense I could ask the same question about unit when https://github.com/fjuniorr/flowmapper/issues/49 is done (eg. if the unit of the target flow was normalized, should I return the original or the normalized) and context (which we could convert to lowercase instead of add case variations to CONTEXT_MAPPING ).

My feeling is that we want (2) for all the properties.

It might even be the case that for context we should return a list instead of '/'.join (except for Flowmap.to_glad which is a tabular representation).

fjuniorr commented 6 months ago

We got 2 extra matches in agribalyse-3.1.1-biosphere:

I've used the new FlowProperty for handling Flow.name and Flow.synonyms.