Open fjuniorr opened 1 year ago
@fjuniorr First of, wow! This is awesome analysis, and it's fantastic to bring in other systems. I feel very lucky that you have agreed to work with us on this.
Using FlowMap
might be the right choice. But I would have put some of the examples differently, sorry for not being clearer in my documentation or code.
For me, the beauty of randonneur
is that is doesn't impose a schema - rather, it just says that if you start with a set of matching conditions, based on the "natural" serialization of an object, then you should apply some attribute changes to make that object compatible with the second system. So I would have written your example:
transformation_spec = {
"update": [
{
"source": {
"sumi": "80596FA1-6D62-4392-AC71-509E5F73D39E",
},
"target": {
"name": "Zinc, in ground",
"uuid": "be73218b-18af-492e-96e6-addd309d1e32",
"context": ["natural resource", "in ground"],
},
}
]
}
I would add the mapper
, datetime
, and memo
to a metadata section applicable to a set of transformations.
The unit conversion is trickier; I need more time to think about that.
We can't rely on SimaPro
having unique identifiers, we will need to match to name
, context
, and possibly unit
. Here is a typical line we would need to harmonize from a SimaPro
CSV export:
Emissions to air
Ethane, 1,1,1-trifluoro-, HFC-143a;;g;0,00043524;Undefined;0;0;0;Refrigerant emissions (R-143a, 52% of R-404a), based on PEF guidance document
Upon some reflection, I think you might be right that there is an implicit schema, but I don't that is avoidable... Do we have an example of an OpenLCA flow mapping file in the wild?
@fjuniorr I hope the above didn't sound too negative. I am totally open to having a different workflow for elementary flows than we do for inventory datasets - but my presumption would be negative, as the types of elementary flow mappings seem very similar to me to other types of flow mappings, such as products. For example, we might need to map one megajoule of "natural gas" to one standard cubic meter of "CH4". To me, this feels very similar, and so should have a common approach.
@fjuniorr First of, wow! This is awesome analysis, and it's fantastic to bring in other systems. I feel very lucky that you have agreed to work with us on this.
Thanks @cmutel! The ramp up costs of switching domains is real so is great to hear that I'm not talking nonsense.
For me, the beauty of
randonneur
is that is doesn't impose a schema - rather, it just says that if you start with a set of matching conditions, based on the "natural" serialization of an object, then you should apply some attribute changes to make that object compatible with the second system.
The point I was trying to make is that the second system needs to expect to find the information it needs from the fields name
, uuid
and context
. It also means that if there are name collisions (eg. in name
) the second system can't access the information from the source flow (eg. to display the old name to the user).
I would add the
mapper
,datetime
, andmemo
to a metadata section applicable to a set of transformations.
👍
We can't rely on
SimaPro
having unique identifiers, we will need to match toname
,context
, and possiblyunit
. Here is a typical line we would need to harmonize from aSimaPro
CSV export:
Yeah, I saw your note that PRé have one UUID for flows regardless of the subcategory context values.
I was using their id because in SimaProv9.4.csv and SProf94_final_substanceList.xlsx since there is no subcategory context their are unique.
__I noticed that in database-1.json we have the subcategory context. Should we work with a "merge" of the two[^20231117T184632]? I ask because I also couldn't find the subcategory context in your sample SimaPro
CSV export line.__
[^20231117T184632]: Which is itself a mapping!
Upon some reflection, I think you might be right that there is an implicit schema, but I don't that is avoidable...
Eventually there will be a schema! IMHO when is the harder question.
Do we have an example of an OpenLCA flow mapping file in the wild?
Not yet.
The fedelemflowlist
has a notebook to generate the mappings but is not working. I will prepare a reprex and submit a issue in their repo to get some help.
I hope the above didn't sound too negative.
Not at all! :)
I am totally open to having a different workflow for elementary flows than we do for inventory datasets - but my presumption would be negative, as the types of elementary flow mappings seem very similar to me to other types of flow mappings, such as products. For example, we might need to map one megajoule of "natural gas" to one standard cubic meter of "CH4". To me, this feels very similar, and so should have a common approach.
Makes total sense. I will keep you posted on how the porting of the jupyter notebooks are going.
I think we are in a good place here. I will leave the final decision to you - I suspect that implementing these ideas will tell you the best way in any case. As you said, the way that client software works with the mappings is up to each software system.
For me, the most important thing is to have a maintainable, transparent, and reproducible system which can be applied to new lists as they are released, and one which can get buy in from the community.
Pinging @WesIngwersen, @tngTUDOR, @tfardet, @ccomb, @msrocka, @thomassonderegger, @seanjpollard, @jsvgoncalves, @johannesecoinvent; you are in the data war trenches, please feel free to come in with your opinions, or to bring others into the conversation! The context of this discussion is that @fjuniorr is working to rewrite and update simapro_ecoinvent_elementary_flows to have a mapping format and software which can be easily applied to inventory databases. He has a lot of experience in data engineering, including working with Frictionless Data.
I noticed that in database-1.json we have the subcategory context. Should we work with a "merge" of the two1? I ask because I also couldn't find the subcategory context in your sample SimaPro CSV export line.
Yes, the context
field is hierarchical, and can vary from one to four or five levels (see comment from Edelen above). At ecoinvent, we have made the decision to actually unite product and ecosphere[^1] flows as we have some substances that can act in the technosphere and the ecosphere (agricultural chemicals, industrial gases), and storing them separately was leading to inconsistencies. A given substance should be treated as a new object if it occurs in a different context.
You can see the available contexts (in one organization's perspective) here: https://glossary.ecoinvent.org/elementary-exchanges/. Of course, these will also evolve over time.
[^1]: We are using ecosphere because for us the biosphere is the part of the ecosphere dealing with living things, but we also characterize abiotic resources like mineral ores, fossil petroleum, etc.
@fjuniorr this looks great, thanks for that! I'm overall in favor of the 2nd proposal in the initial post.
What follows is just me trying to figure out the update proposal by @cmutel in his reply, so feel free to ignore if it turns out to be nonsense.
I expect that we want, whenever possible, to just match unique ids and let the target software work with the equivalent objects with its own entries and values, converting whatever values should be changed in the exchanges only (e.g. if units vary).
So whenever possible
mapping_spec = {
"match": [
{
"source": {
"sumi": "80596FA1-6D62-4392-AC71-509E5F73D39E",
},
"target": {
"@id": "be73218b-18af-492e-96e6-addd309d1e32",
},
"conversionFactor": 1,
"MatchCondition": "=",
...
]
}
If there are not unique ids, we supply whatever is necessary to make the match. In the example given by @cmutel, as SimaPro is the one missing the unique ID, I would expect the source
entry to contain the additional information needed for the match (name and context from what I understood), leading to:
mapping_spec = {
"match": [
{
"source": {
"sumi": "non-unique-simapro-id",
"name": ... ,
"context": ... ,
},
"target": {
"@id": "unique-ecoinvent-id",
},
"conversionFactor": 1,
"MatchCondition": "=",
...
]
}
but we don't modify any entry so that the software just works with what it has, as expected.
If there are not unique ids, we supply whatever is necessary to make the match. In the example given by @cmutel, as SimaPro is the one missing the unique ID, I would expect the source entry to contain the additional information needed for the match (name and context from what I understood)
@tfardet you are totally correct, this is a poor example as the "sumi" is not enough to uniquely identify an object in their ontology.
Agree 100% with what you said @tfardet!
@cmutel at least for know I don't see a problem in switching the current implementation[^20231119T173838] between any of the formats that we discussed[^20231119T172139], so it might come down do the use cases that we see in the future and alignment with fedelemflowlist and openLCA FlowMap.
[^20231119T172139]: With what I believe are some small gotchas such as https://github.com/fjuniorr/flowmapper/issues/13 and https://github.com/fjuniorr/flowmapper/issues/12
[^20231119T173838]: Which is using nested for loops 🙃
I've also created an example project that makes use of https://github.com/fjuniorr/flowmapper/pull/11 to generate mappings from SimaPro 9.4 and ecoinvent 3.7 using the existing logic of simapro_ecoinvent_elementary_flows.
Interesting discussion; some points regarding the openLCA FlowMap schema:
Thanks for the pointers and PR @msrocka!
I think we are in a good place here. I will leave the final decision to you - I suspect that implementing these ideas will tell you the best way in any case. As you said, the way that client software works with the mappings is up to each software system.
At least for inspecting the flows that matched having the usual randonneur
is nicer because I don't have to do such much lookup.
@fjuniorr I think I understand this better now - before I more focused on some specific details but wasn't engaging with the fundamental question of mapping versus transformations. If I understand correctly, we are now focused on mappings across systems. That's fine, but I would like your opinion on how to handle transitive mappings. For example, today I was mapping a flow from SimaPro:
{
"name": "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore",
"categories": [
"Resources",
"in ground"
],
"unit": "kg"
}
To ecoinvent version 3.10. This didn't match, because there was a change from ecoinvent version 3.9 to version 3.10; the flow is now:
{
"name": "Copper",
"categories": [
"natural resource",
"in ground"
],
"unit": "kilogram",
"uuid": "a9ac40a0-9bea-4c48-afa7-66aa6eb90624",
"CAS number": "007440-50-8",
"synonyms": []
}
How should we get this right? The 3.9 to 3.10, or any set of transformations, should be usable in our generation of mapping files. To me this is blurring the line between transformations and mapping...
@cmutel first off, I think we definitely need transformations eventually. However storing only the mappings makes it easier to create specific transformations for different use cases. The methods Flowmap.to_randonneur
and Flowmap.to_glad
are good examples of this.
before I more focused on some specific details but wasn't engaging with the fundamental question of mapping versus transformations. If I understand correctly, we are now focused on mappings across systems.
I think we are concentrating on mappings across different systems mainly because:
simapro -> ecoinvent
mappings, not something like ecoinvent3.7 -> ecoinvent 3.8
;That's fine, but I would like your opinion on how to handle transitive mappings. For example [...]
How should we get this right? The 3.9 to 3.10, or any set of transformations, should be usable in our generation of mapping files. To me this is blurring the line between transformations and mapping...
Let's ensure we're on the same page with your example[^20231213T094212].
The flow "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore" from SimaPro agribalyse-3.1.1-biosphere
matched with flow 1aee4aa7-32e0-48e7-a6b5-73d8acf672d3
from ecoinvent-3.7-biosphere
by "Resources with suffix in ground".
However, in ecoinvent-3.9-biosphere
, the flow 1aee4aa7-32e0-48e7-a6b5-73d8acf672d3
lost its "in ground" suffix but still matched due to identical names. Yet, another flow, "Copper, Cu 5.2E-2%, Pt 4.8E-4%, Pd 2.0E-4%, Rh 2.4E-5%, Ni 3.7E-2% in ore", also matched to the same UUID in ecoinvent-3.7-biosphere
due to minor name differences. Since we didn't update the dict with name differences, it didn't match in ecoinvent-3.9-biosphere
.
The matches to UUID a9ac40a0-9bea-4c48-afa7-66aa6eb90624
in both agribalyse-3.1.1-biosphere-ecoinvent-3.7-biosphere
and agribalyse-3.1.1-biosphere-ecoinvent-3.9-biosphere
look fine.
Now, how should we address this? I don't think it's a blurring of lines between transformations and mapping. It's more a result of our current approach, where we don't use information from one mapping (like agribalyse-3.1.1-biosphere-ecoinvent-3.7-biosphere
) to inform others (such as agribalyse-3.1.1-biosphere-ecoinvent-3.9-biosphere
).
For example, since "Copper, Cu 5.2E-2%, Pt 4.8E-4%, Pd 2.0E-4%, Rh 2.4E-5%, Ni 3.7E-2% in ore" matched UUID 1aee4aa7-32e0-48e7-a6b5-73d8acf672d3
in ecoinvent-3.7, in theory, we have enough information to make it match in ecoinvent-3.9.
I'm not sure if the best approach is to save a state of certified mappings across specific systems and versions, or if we should assume the need to update the match rules to accommodate changes in the flow lists.
[^20231213T094212]: Here's a more complete breakdown. The original flow from SimaPro you mentioned agribalyse-3.1.1-biosphere
is:
```json
{
"name": "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore",
"unit": "kg",
"categories": [
"Resources",
"in ground"
]
}
```
The match with `ecoinvent-3.7-biosphere` was:
```json
{
"source": {
"name": "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore",
"categories": [
"Resources",
"in ground"
],
"unit": "kg"
},
"target": {
"uuid": "1aee4aa7-32e0-48e7-a6b5-73d8acf672d3",
"name": "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore, in ground",
"context": "natural resource/in ground",
"unit": "kg"
},
"conversion_factor": 1.0,
"comment": "Resources with suffix in ground"
}
```
In `ecoinvent-3.9-biosphere` the flow `1aee4aa7-32e0-48e7-a6b5-73d8acf672d3` lost the "in ground" suffix but matched on "Identical names":
```json
{
"source": {
"name": "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore",
"categories": [
"Resources",
"in ground"
],
"unit": "kg"
},
"target": {
"uuid": "1aee4aa7-32e0-48e7-a6b5-73d8acf672d3",
"name": "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore",
"context": "natural resource/in ground",
"unit": "kg"
},
"conversion_factor": 1.0,
"comment": "Identical names"
}
```
However "Copper, Cu 5.2E-2%, Pt 4.8E-4%, Pd 2.0E-4%, Rh 2.4E-5%, Ni 3.7E-2% in ore" also matched to `1aee4aa7-32e0-48e7-a6b5-73d8acf672d3` in `ecoinvent-3.7-biosphere` because of the dict with name differences:
```json
{
"source": {
"name": "Copper, Cu 5.2E-2%, Pt 4.8E-4%, Pd 2.0E-4%, Rh 2.4E-5%, Ni 3.7E-2% in ore",
"categories": [
"Resources",
"in ground"
],
"unit": "kg"
},
"target": {
"uuid": "1aee4aa7-32e0-48e7-a6b5-73d8acf672d3",
"name": "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore, in ground",
"context": "natural resource/in ground",
"unit": "kg"
},
"conversion_factor": 1.0,
"comment": "Minor random name differences"
}
```
Since `1aee4aa7-32e0-48e7-a6b5-73d8acf672d3` lost the "in ground" suffix in `ecoinvent-3.9-biosphere` and we didn't update the dict with name differences it didn't match.
The matches to `a9ac40a0-9bea-4c48-afa7-66aa6eb90624` look fine:
### agribalyse-3.1.1-biosphere-ecoinvent-3.7-biosphere
```json
[
{
"source": {
"name": "Copper",
"categories": [
"Resources",
"in ground"
],
"unit": "kg"
},
"target": {
"uuid": "a9ac40a0-9bea-4c48-afa7-66aa6eb90624",
"name": "Copper, in ground",
"context": "natural resource/in ground",
"unit": "kg"
},
"conversion_factor": 1.0,
"comment": "Resources with suffix in ground"
},
{
"source": {
"name": "Copper",
"categories": [
"Resources",
""
],
"unit": "kg"
},
"target": {
"uuid": "a9ac40a0-9bea-4c48-afa7-66aa6eb90624",
"name": "Copper, in ground",
"context": "natural resource/in ground",
"unit": "kg"
},
"conversion_factor": 1.0,
"comment": "Resources with suffix in ground"
}
]
```
### agribalyse-3.1.1-biosphere-ecoinvent-3.9-biosphere
```json
[
{
"source": {
"name": "Copper",
"categories": [
"Resources",
"in ground"
],
"unit": "kg"
},
"target": {
"uuid": "a9ac40a0-9bea-4c48-afa7-66aa6eb90624",
"name": "Copper",
"context": "natural resource/in ground",
"unit": "kg"
},
"conversion_factor": 1.0,
"comment": "Identical names"
},
{
"source": {
"name": "Copper",
"categories": [
"Resources",
""
],
"unit": "kg"
},
"target": {
"uuid": "a9ac40a0-9bea-4c48-afa7-66aa6eb90624",
"name": "Copper",
"context": "natural resource/in ground",
"unit": "kg"
},
"conversion_factor": 1.0,
"comment": "Identical names"
}
]
```
OK with the decision that this library is focused on mappings, and that transformations are needed but can be generated from the mappings and are a separate unit of work.
if we should assume the need to update the match rules to accommodate changes in the flow lists.
I don't see any way around this. The lists per system change, and we can't rely on things like uuid
to be stable or map 1-to-1. For example, the flow Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore
, and all copper resource flows are now just copper in version 3.10 (with a different uuid
).
So I think we need to plan for generic transformations, and a config parameter to load specific mapping data based on the input/output combinations.
So I think we need to plan for generic transformations, and a config parameter to load specific mapping data based on the input/output combinations.
It would be great if we could keep the match rules valid and generic for every input/output combination, but I agree that this can get messy or downright impossible.
The
randonneur
data migration verbs are focused on specifying the changes that need to be made in order to convert a source object into a target object.In our case the focus is converting one source flow to a target flow that may differ in some characteristics but otherwise represent the same flow[^20231117T094934].
[^20231117T094934]: From (Edelen, et al., 2017)
Take for example this one-to-one mapping[^20231117T090313] (suffix in ground):
[^20231117T090313]: Using this simapro excel sheet represented as a dict and this encoinvent xml represented as a dict using
xmltodict
.One way to express the conversion using
randonneur
update
verb is:Which would generate
I think there are two main downsides to this approach.
Firstly, because we are specifying actual conversions rules, we need to impose a schema on the data. In this example I'm using GLAD flow mapping format but the problem persists even if the target flow list is always the same (eg. ecoinvent) because the problem will appear when a client application (ie. LCA software) expects to consume data in a different format.
Secondly, a conversion that keeps all the metadata from the target flow is verbose because all the fields need to be specified.
A more flexible approach so that we don't need to impose a schema on the target flow is to encode matching information on the target node and not transformations. Extra metadata for the transformations needed (such as unit conversions) should also be added but the client application should do the actual conversion. For example[^20231117T093941]:
[^20231117T093941]: It should be noted that probably makes more sense to reuse openLCA schema FlowMap then to create a new format. This means that the input and output flow lists need to be standardized.
In the randonneur examples it's model of making changes to data is not so problematic because both source and target data already share the same schema.
The challenge of how to generate the mapping information in a way that is reproducible and inspectable for individuals that don't code remains.