fjuniorr / flowmapper

Mappings between elementary flows
MIT License
0 stars 1 forks source link

Add matching against synonyms #45

Closed cmutel closed 6 months ago

cmutel commented 6 months ago

In some cases a source or target list has synonyms, and these can really help improve the matching percentages. They are currently not used, and some work is needed (alterations to the config, figuring out how to deal with lists of synonym dicts in glom, etc.)

cmutel commented 6 months ago

@fjuniorr I hacked together something and it boosted the matching by quite a lot. I am hoping you can figure out how to deal with glom, I tried https://stackoverflow.com/questions/67518774/get-first-item-from-nested-list-in-glom but just skipped glom in my test code).

fjuniorr commented 6 months ago

@fjuniorr I hacked together something and it boosted the matching by quite a lot.

@cmutel did you push this to GitHub? I didn't find it in your fork.

cmutel commented 6 months ago

@cmutel did you push this to GitHub? I didn't find it in your fork.

No, it was a dirty hack which didn't refer to the config. The synonym data in ecoinvent looks like:

    "synonym": [
      {
        "@xml:lang": "en",
        "#text": "perfluoromethane, tetrafluoromethane"
      },
      {
        "@xml:lang": "en",
        "#text": "R-14"
      },
      {
        "@xml:lang": "en",
        "#text": "fc-14"
      },
      {
        "@xml:lang": "en",
        "#text": "Methane, tetrafluoro-"
      }
    ],
fjuniorr commented 6 months ago

Were you thinking about:

  1. evaluating every match rule that uses the flow name with each flow synonym? or
  2. creating a single match rule for equality between the original flow name and each flow synonym?

Regarding glom we can extract nested data using a (path, [subpath]) pattern. For example:

from glom import glom

flow = {
    "synonym": [
        {"@xml:lang": "en", "#text": "perfluoromethane, tetrafluoromethane"},
        {"@xml:lang": "en", "#text": "R-14"},
        {"@xml:lang": "en", "#text": "fc-14"},
        {"@xml:lang": "en", "#text": "Methane, tetrafluoro-"}
    ]
}
glom(flow, ('synonym', ['#text']))
# ['perfluoromethane, tetrafluoromethane', 'R-14', 'fc-14', 'Methane, tetrafluoro-']

This spec is awkward to store as a string in a toml config file. Some options would be:

  1. Receive only the top level property synonym in the config and assume a specific schema for the ecoinvent data
  2. Switch from using toml to using regular py files for storing field mapping config
  3. Using something like ast.literal_eval("('synonym', ['#text'])") and keep the existing toml config

Overall my suggestions would be the pair 2-2 but it would be nice to hear what you think before coding starts.

cmutel commented 6 months ago

Were you thinking about: evaluating every match rule that uses the flow name with each flow synonym? or creating a single match rule for equality between the original flow name and each flow synonym?

This is an excellent question - I actually really like the way you have set things up with custom classes for certain attributes where one can easily test for equality. I hope we can extend this to unit conversions, I guess you are thinking the same thing.

But the fact that we have to ask this question is bothering me a bit. Basically, I think we should have one generic match rule, and then some exceptions that handle specific cases. The generic rule should handle category/context matching (I think this is good now?), and unit conversions. The special cases are then not really relevant, as they are mostly related to naming, and synonyms are kind of another specific except but one that sits in parallel but not interacting with the other special cases.

A long way of saying I agree with you on number 2 :)

For the second question I don't have a strong preference (except that the third option is a bit too creative), so 2-2 is fine with me.

fjuniorr commented 6 months ago

This is an excellent question - I actually really like the way you have set things up with custom classes for certain attributes where one can easily test for equality. I hope we can extend this to unit conversions, I guess you are thinking the same thing.

I will track down the Unit class in https://github.com/fjuniorr/flowmapper/issues/49

@fjuniorr I hacked together something and it boosted the matching by quite a lot.

I took a stab at 2-2 in https://github.com/fjuniorr/flowmapper/pull/48 however there were only 8 new mappings. Were you using agribalyse-3.1.1-biosphere and industry-2.0-biosphere or other datasets?

Source SourceFlowName SourceFlowContext TargetFlowName TargetFlowUUID TargetFlowContext
agribalyse Carfentrazone-ethyl Emissions to soil/agricultural Carfentrazone ethyl ester d07867e3-66a8-4454-babd-78dc7f9a21f8 soil/agricultural
agribalyse Sulfuric acid Emissions to water/ Ammonium sulfate 8570c45a-8c78-4709-9b8f-fb88314d9e9d water/unspecified
agribalyse Triallate Emissions to soil/agricultural Tri-allate c5c25aa6-d630-40bd-bed7-4e718c877ef4 soil/agricultural
agribalyse Calcium sulfate Resources/ Anhydrite, in ground 6df9ea09-115a-4678-9f30-d92c877a46ec natural resource/in ground
industry Triallate Emissions to soil/agricultural Tri-allate c5c25aa6-d630-40bd-bed7-4e718c877ef4 soil/agricultural
industry Carfentrazone-ethyl Emissions to soil/agricultural Carfentrazone ethyl ester d07867e3-66a8-4454-babd-78dc7f9a21f8 soil/agricultural
industry Calcium sulfate Resources/in ground Anhydrite, in ground 6df9ea09-115a-4678-9f30-d92c877a46ec natural resource/in ground
industry Calcium sulfate Resources/ Anhydrite, in ground 6df9ea09-115a-4678-9f30-d92c877a46ec natural resource/in ground