Add matching against synonyms

cmutel commented 6 months ago

In some cases a source or target list has synonyms, and these can really help improve the matching percentages. They are currently not used, and some work is needed (alterations to the config, figuring out how to deal with lists of synonym dicts in glom, etc.)

cmutel commented 6 months ago

@fjuniorr I hacked together something and it boosted the matching by quite a lot. I am hoping you can figure out how to deal with glom, I tried https://stackoverflow.com/questions/67518774/get-first-item-from-nested-list-in-glom but just skipped glom in my test code).

fjuniorr commented 6 months ago

@fjuniorr I hacked together something and it boosted the matching by quite a lot.

@cmutel did you push this to GitHub? I didn't find it in your fork.

cmutel commented 6 months ago

@cmutel did you push this to GitHub? I didn't find it in your fork.

No, it was a dirty hack which didn't refer to the config. The synonym data in ecoinvent looks like:

    "synonym": [
      {
        "@xml:lang": "en",
        "#text": "perfluoromethane, tetrafluoromethane"
      },
      {
        "@xml:lang": "en",
        "#text": "R-14"
      },
      {
        "@xml:lang": "en",
        "#text": "fc-14"
      },
      {
        "@xml:lang": "en",
        "#text": "Methane, tetrafluoro-"
      }
    ],

fjuniorr commented 6 months ago

Were you thinking about:

evaluating every match rule that uses the flow name with each flow synonym? or
creating a single match rule for equality between the original flow name and each flow synonym?

Regarding glom we can extract nested data using a (path, [subpath]) pattern. For example:

from glom import glom

flow = {
    "synonym": [
        {"@xml:lang": "en", "#text": "perfluoromethane, tetrafluoromethane"},
        {"@xml:lang": "en", "#text": "R-14"},
        {"@xml:lang": "en", "#text": "fc-14"},
        {"@xml:lang": "en", "#text": "Methane, tetrafluoro-"}
    ]
}
glom(flow, ('synonym', ['#text']))
# ['perfluoromethane, tetrafluoromethane', 'R-14', 'fc-14', 'Methane, tetrafluoro-']

This spec is awkward to store as a string in a toml config file. Some options would be:

Receive only the top level property synonym in the config and assume a specific schema for the ecoinvent data
Switch from using toml to using regular py files for storing field mapping config
Using something like ast.literal_eval("('synonym', ['#text'])") and keep the existing toml config

Overall my suggestions would be the pair 2-2 but it would be nice to hear what you think before coding starts.

cmutel commented 6 months ago

Were you thinking about: evaluating every match rule that uses the flow name with each flow synonym? or creating a single match rule for equality between the original flow name and each flow synonym?

This is an excellent question - I actually really like the way you have set things up with custom classes for certain attributes where one can easily test for equality. I hope we can extend this to unit conversions, I guess you are thinking the same thing.

But the fact that we have to ask this question is bothering me a bit. Basically, I think we should have one generic match rule, and then some exceptions that handle specific cases. The generic rule should handle category/context matching (I think this is good now?), and unit conversions. The special cases are then not really relevant, as they are mostly related to naming, and synonyms are kind of another specific except but one that sits in parallel but not interacting with the other special cases.

A long way of saying I agree with you on number 2 :)

For the second question I don't have a strong preference (except that the third option is a bit too creative), so 2-2 is fine with me.

fjuniorr commented 6 months ago

This is an excellent question - I actually really like the way you have set things up with custom classes for certain attributes where one can easily test for equality. I hope we can extend this to unit conversions, I guess you are thinking the same thing.

I will track down the Unit class in https://github.com/fjuniorr/flowmapper/issues/49

@fjuniorr I hacked together something and it boosted the matching by quite a lot.

I took a stab at 2-2 in https://github.com/fjuniorr/flowmapper/pull/48 however there were only 8 new mappings. Were you using agribalyse-3.1.1-biosphere and industry-2.0-biosphere or other datasets?

Source	SourceFlowName	SourceFlowContext	TargetFlowName	TargetFlowUUID	TargetFlowContext
agribalyse	Carfentrazone-ethyl	Emissions to soil/agricultural	Carfentrazone ethyl ester	d07867e3-66a8-4454-babd-78dc7f9a21f8	soil/agricultural
agribalyse	Sulfuric acid	Emissions to water/	Ammonium sulfate	8570c45a-8c78-4709-9b8f-fb88314d9e9d	water/unspecified
agribalyse	Triallate	Emissions to soil/agricultural	Tri-allate	c5c25aa6-d630-40bd-bed7-4e718c877ef4	soil/agricultural
agribalyse	Calcium sulfate	Resources/	Anhydrite, in ground	6df9ea09-115a-4678-9f30-d92c877a46ec	natural resource/in ground
industry	Triallate	Emissions to soil/agricultural	Tri-allate	c5c25aa6-d630-40bd-bed7-4e718c877ef4	soil/agricultural
industry	Carfentrazone-ethyl	Emissions to soil/agricultural	Carfentrazone ethyl ester	d07867e3-66a8-4454-babd-78dc7f9a21f8	soil/agricultural
industry	Calcium sulfate	Resources/in ground	Anhydrite, in ground	6df9ea09-115a-4678-9f30-d92c877a46ec	natural resource/in ground
industry	Calcium sulfate	Resources/	Anhydrite, in ground	6df9ea09-115a-4678-9f30-d92c877a46ec	natural resource/in ground

fjuniorr / flowmapper

Add matching against synonyms #45