Closed cmutel closed 6 months ago
@fjuniorr I hacked together something and it boosted the matching by quite a lot. I am hoping you can figure out how to deal with glom
, I tried https://stackoverflow.com/questions/67518774/get-first-item-from-nested-list-in-glom but just skipped glom
in my test code).
@fjuniorr I hacked together something and it boosted the matching by quite a lot.
@cmutel did you push this to GitHub? I didn't find it in your fork.
@cmutel did you push this to GitHub? I didn't find it in your fork.
No, it was a dirty hack which didn't refer to the config. The synonym data in ecoinvent looks like:
"synonym": [
{
"@xml:lang": "en",
"#text": "perfluoromethane, tetrafluoromethane"
},
{
"@xml:lang": "en",
"#text": "R-14"
},
{
"@xml:lang": "en",
"#text": "fc-14"
},
{
"@xml:lang": "en",
"#text": "Methane, tetrafluoro-"
}
],
Were you thinking about:
Regarding glom
we can extract nested data using a (path, [subpath])
pattern. For example:
from glom import glom
flow = {
"synonym": [
{"@xml:lang": "en", "#text": "perfluoromethane, tetrafluoromethane"},
{"@xml:lang": "en", "#text": "R-14"},
{"@xml:lang": "en", "#text": "fc-14"},
{"@xml:lang": "en", "#text": "Methane, tetrafluoro-"}
]
}
glom(flow, ('synonym', ['#text']))
# ['perfluoromethane, tetrafluoromethane', 'R-14', 'fc-14', 'Methane, tetrafluoro-']
This spec is awkward to store as a string in a toml config file. Some options would be:
synonym
in the config and assume a specific schema for the ecoinvent datatoml
to using regular py
files for storing field mapping configast.literal_eval("('synonym', ['#text'])")
and keep the existing toml configOverall my suggestions would be the pair 2-2 but it would be nice to hear what you think before coding starts.
Were you thinking about: evaluating every match rule that uses the flow name with each flow synonym? or creating a single match rule for equality between the original flow name and each flow synonym?
This is an excellent question - I actually really like the way you have set things up with custom classes for certain attributes where one can easily test for equality. I hope we can extend this to unit conversions, I guess you are thinking the same thing.
But the fact that we have to ask this question is bothering me a bit. Basically, I think we should have one generic match rule, and then some exceptions that handle specific cases. The generic rule should handle category/context matching (I think this is good now?), and unit conversions. The special cases are then not really relevant, as they are mostly related to naming, and synonyms are kind of another specific except but one that sits in parallel but not interacting with the other special cases.
A long way of saying I agree with you on number 2 :)
For the second question I don't have a strong preference (except that the third option is a bit too creative), so 2-2 is fine with me.
This is an excellent question - I actually really like the way you have set things up with custom classes for certain attributes where one can easily test for equality. I hope we can extend this to unit conversions, I guess you are thinking the same thing.
I will track down the Unit class in https://github.com/fjuniorr/flowmapper/issues/49
@fjuniorr I hacked together something and it boosted the matching by quite a lot.
I took a stab at 2-2 in https://github.com/fjuniorr/flowmapper/pull/48 however there were only 8 new mappings. Were you using agribalyse-3.1.1-biosphere
and industry-2.0-biosphere
or other datasets?
Source | SourceFlowName | SourceFlowContext | TargetFlowName | TargetFlowUUID | TargetFlowContext |
---|---|---|---|---|---|
agribalyse | Carfentrazone-ethyl | Emissions to soil/agricultural | Carfentrazone ethyl ester | d07867e3-66a8-4454-babd-78dc7f9a21f8 | soil/agricultural |
agribalyse | Sulfuric acid | Emissions to water/ | Ammonium sulfate | 8570c45a-8c78-4709-9b8f-fb88314d9e9d | water/unspecified |
agribalyse | Triallate | Emissions to soil/agricultural | Tri-allate | c5c25aa6-d630-40bd-bed7-4e718c877ef4 | soil/agricultural |
agribalyse | Calcium sulfate | Resources/ | Anhydrite, in ground | 6df9ea09-115a-4678-9f30-d92c877a46ec | natural resource/in ground |
industry | Triallate | Emissions to soil/agricultural | Tri-allate | c5c25aa6-d630-40bd-bed7-4e718c877ef4 | soil/agricultural |
industry | Carfentrazone-ethyl | Emissions to soil/agricultural | Carfentrazone ethyl ester | d07867e3-66a8-4454-babd-78dc7f9a21f8 | soil/agricultural |
industry | Calcium sulfate | Resources/in ground | Anhydrite, in ground | 6df9ea09-115a-4678-9f30-d92c877a46ec | natural resource/in ground |
industry | Calcium sulfate | Resources/ | Anhydrite, in ground | 6df9ea09-115a-4678-9f30-d92c877a46ec | natural resource/in ground |
In some cases a source or target list has synonyms, and these can really help improve the matching percentages. They are currently not used, and some work is needed (alterations to the config, figuring out how to deal with lists of synonym dicts in
glom
, etc.)