fjuniorr / flowmapper

Mappings between elementary flows
MIT License
0 stars 1 forks source link

Data transformations vs data mapping #1

Open fjuniorr opened 7 months ago

fjuniorr commented 7 months ago

The randonneur data migration verbs are focused on specifying the changes that need to be made in order to convert a source object into a target object.

In our case the focus is converting one source flow to a target flow that may differ in some characteristics but otherwise represent the same flow[^20231117T094934].

[^20231117T094934]: From (Edelen, et al., 2017)

> Elementary flows generally need to have a minimum of three components to identify them, but may have more:
> 
> 1. The name of the material, energy, or space (e.g., “Carbon dioxide” or “freshwater”) that will enter or leave the technosphere. This is commonly called “substance” but this term is too limited and the term flowable from the ECO LCA ontology (McBride and Norris 2010) is used by the authors.
> 2. The flow context, which are a set of categories typically describing an environmental context of the flow origin or destination (e.g., “to air”). The name compartment or category is often used for this component, but we used context to provide a broader meaning that includes the flow directionality (e.g. “resource” or “emission”). The categories can be tiered in one or sometimes up to four or five levels.
> 3. A flow unit and its associated flow property (e.g., kg/mass). Flow units may be associated with conversion factors that can be used to convert between different units within a flow property (e.g., kg to lbs.) or even between flow properties (e.g., kg to m3).
>
> Each of these individual flow components may be associated with more information, or metadata, in part dependent on what type of flow they are. For instance: flowables, if chemicals, may have a Chemical Abstracts Service number (CAS No.) and be associated with various other intrinsic properties. Other types of flows, like land occupation or raw energy inputs may not have this additional information. Flows at a minimum should have a flowable, context and unit, and the unique combination of these components may be considered a unique flow, but whether or not it is unique is ultimately determined by the system in which it is used (e.g., LCA software).

Take for example this one-to-one mapping[^20231117T090313] (suffix in ground):

[^20231117T090313]: Using this simapro excel sheet represented as a dict and this encoinvent xml represented as a dict using xmltodict.

# simapro
sp = {
    "baseName": "Zinc",
    "cas": "007440-66-6",
    "description": "Formula: Zn\x7f",
    "geo": None,
    "mainCompartment": "Raw materials",
    "name": "Zinc",
    "sumi": "80596FA1-6D62-4392-AC71-509E5F73D39E",
    "unit": "kg",
}
# ecoinvent
ei = {
    "@casNumber": "7440-66-6",
    "@formula": "Zn",
    "@id": "be73218b-18af-492e-96e6-addd309d1e32",
    "@unitId": "487df68b-4994-4027-8fdc-a4dc298257b7",
    "compartment": {
        "@subcompartmentId": "6a098164-9f04-4f65-8104-ffab7f2677f3",
        "compartment": {"#text": "natural resource", "@xml:lang": "en"},
        "subcompartment": {"#text": "in ground", "@xml:lang": "en"},
    },
    "name": {"#text": "Zinc, in ground", "@xml:lang": "en"},
    "productInformation": None,
    "unitName": {"#text": "kg", "@xml:lang": "en"},
}

One way to express the conversion using randonneur update verb is:

transformation_spec = {
  "update": [
      {
          "source": {
              "sumi": "80596FA1-6D62-4392-AC71-509E5F73D39E",
          },
          "target": {
              "TargetFlowName": "Zinc, in ground",
              "TargetFlowUUID": "be73218b-18af-492e-96e6-addd309d1e32",
              "TargetFlowContext": "natural resource, in ground",
              "TargetUnit": "kg",
              "ConversionFactor": 1,
              "MatchCondition": "=",
              "Mapper": "Chris Mutel",
              "LastUpdated": "2023-01-09 11:52:24",
              "Memo": "Automated match. Notebook: Match - Resources with suffix in ground"
          },
      }
  ]
}

Which would generate

{
    "TargetFlowName": "Zinc, in ground",
    "TargetFlowUUID": "be73218b-18af-492e-96e6-addd309d1e32",
    "TargetFlowContext": "natural resource, in ground",
    "TargetUnit": "kg",
    "ConversionFactor": 1,
    "MatchCondition": "=",
    "Mapper": "Chris Mutel",
    "LastUpdated": "2023-01-09 11:52:24",
    "Memo": "Automated match. Notebook: Match - Resources with suffix in ground",
    "baseName": "Zinc",
    "cas": "007440-66-6",
    "description": "Formula: Zn\x7f",
    "geo": None,
    "mainCompartment": "Raw materials",
    "name": "Zinc",
    "sumi": "80596FA1-6D62-4392-AC71-509E5F73D39E",
    "unit": "kg",
}

I think there are two main downsides to this approach.

Firstly, because we are specifying actual conversions rules, we need to impose a schema on the data. In this example I'm using GLAD flow mapping format but the problem persists even if the target flow list is always the same (eg. ecoinvent) because the problem will appear when a client application (ie. LCA software) expects to consume data in a different format.

Secondly, a conversion that keeps all the metadata from the target flow is verbose because all the fields need to be specified.

A more flexible approach so that we don't need to impose a schema on the target flow is to encode matching information on the target node and not transformations. Extra metadata for the transformations needed (such as unit conversions) should also be added but the client application should do the actual conversion. For example[^20231117T093941]:

[^20231117T093941]: It should be noted that probably makes more sense to reuse openLCA schema FlowMap then to create a new format. This means that the input and output flow lists need to be standardized.

mapping_spec = {
  "match": [
    {
      "source": {
          "sumi": "80596FA1-6D62-4392-AC71-509E5F73D39E",
      },
      "target": {
          "@id": "be73218b-18af-492e-96e6-addd309d1e32",
      },
      "conversionFactor": 1,
      "MatchCondition": "=",
      "Mapper": "Chris Mutel",
      "LastUpdated": "2023-01-09 11:52:24",
      "Memo": "Automated match. Notebook: Match - Resources with suffix in ground"      
    }
  ]
}

In the randonneur examples it's model of making changes to data is not so problematic because both source and target data already share the same schema.

The challenge of how to generate the mapping information in a way that is reproducible and inspectable for individuals that don't code remains.

cmutel commented 7 months ago

@fjuniorr First of, wow! This is awesome analysis, and it's fantastic to bring in other systems. I feel very lucky that you have agreed to work with us on this.

Using FlowMap might be the right choice. But I would have put some of the examples differently, sorry for not being clearer in my documentation or code.

For me, the beauty of randonneur is that is doesn't impose a schema - rather, it just says that if you start with a set of matching conditions, based on the "natural" serialization of an object, then you should apply some attribute changes to make that object compatible with the second system. So I would have written your example:

transformation_spec = {
  "update": [
      {
          "source": {
              "sumi": "80596FA1-6D62-4392-AC71-509E5F73D39E",
          },
          "target": {
              "name": "Zinc, in ground",
              "uuid": "be73218b-18af-492e-96e6-addd309d1e32",
              "context": ["natural resource", "in ground"],
          },
      }
  ]
}

I would add the mapper, datetime, and memo to a metadata section applicable to a set of transformations.

The unit conversion is trickier; I need more time to think about that.

We can't rely on SimaPro having unique identifiers, we will need to match to name, context, and possibly unit. Here is a typical line we would need to harmonize from a SimaPro CSV export:

Emissions to air
Ethane, 1,1,1-trifluoro-, HFC-143a;;g;0,00043524;Undefined;0;0;0;Refrigerant emissions (R-143a, 52% of R-404a), based on PEF guidance document

Upon some reflection, I think you might be right that there is an implicit schema, but I don't that is avoidable... Do we have an example of an OpenLCA flow mapping file in the wild?

cmutel commented 7 months ago

@fjuniorr I hope the above didn't sound too negative. I am totally open to having a different workflow for elementary flows than we do for inventory datasets - but my presumption would be negative, as the types of elementary flow mappings seem very similar to me to other types of flow mappings, such as products. For example, we might need to map one megajoule of "natural gas" to one standard cubic meter of "CH4". To me, this feels very similar, and so should have a common approach.

fjuniorr commented 7 months ago

@fjuniorr First of, wow! This is awesome analysis, and it's fantastic to bring in other systems. I feel very lucky that you have agreed to work with us on this.

Thanks @cmutel! The ramp up costs of switching domains is real so is great to hear that I'm not talking nonsense.

For me, the beauty of randonneur is that is doesn't impose a schema - rather, it just says that if you start with a set of matching conditions, based on the "natural" serialization of an object, then you should apply some attribute changes to make that object compatible with the second system.

The point I was trying to make is that the second system needs to expect to find the information it needs from the fields name, uuid and context. It also means that if there are name collisions (eg. in name) the second system can't access the information from the source flow (eg. to display the old name to the user).

I would add the mapper, datetime, and memo to a metadata section applicable to a set of transformations.

👍

We can't rely on SimaPro having unique identifiers, we will need to match to name, context, and possibly unit. Here is a typical line we would need to harmonize from a SimaPro CSV export:

Yeah, I saw your note that PRé have one UUID for flows regardless of the subcategory context values.

I was using their id because in SimaProv9.4.csv and SProf94_final_substanceList.xlsx since there is no subcategory context their are unique.

__I noticed that in database-1.json we have the subcategory context. Should we work with a "merge" of the two[^20231117T184632]? I ask because I also couldn't find the subcategory context in your sample SimaPro CSV export line.__

[^20231117T184632]: Which is itself a mapping!

Upon some reflection, I think you might be right that there is an implicit schema, but I don't that is avoidable...

Eventually there will be a schema! IMHO when is the harder question.

Do we have an example of an OpenLCA flow mapping file in the wild?

Not yet.

The fedelemflowlist has a notebook to generate the mappings but is not working. I will prepare a reprex and submit a issue in their repo to get some help.

I hope the above didn't sound too negative.

Not at all! :)

I am totally open to having a different workflow for elementary flows than we do for inventory datasets - but my presumption would be negative, as the types of elementary flow mappings seem very similar to me to other types of flow mappings, such as products. For example, we might need to map one megajoule of "natural gas" to one standard cubic meter of "CH4". To me, this feels very similar, and so should have a common approach.

Makes total sense. I will keep you posted on how the porting of the jupyter notebooks are going.

cmutel commented 7 months ago

I think we are in a good place here. I will leave the final decision to you - I suspect that implementing these ideas will tell you the best way in any case. As you said, the way that client software works with the mappings is up to each software system.

For me, the most important thing is to have a maintainable, transparent, and reproducible system which can be applied to new lists as they are released, and one which can get buy in from the community.

Pinging @WesIngwersen, @tngTUDOR, @tfardet, @ccomb, @msrocka, @thomassonderegger, @seanjpollard, @jsvgoncalves, @johannesecoinvent; you are in the data war trenches, please feel free to come in with your opinions, or to bring others into the conversation! The context of this discussion is that @fjuniorr is working to rewrite and update simapro_ecoinvent_elementary_flows to have a mapping format and software which can be easily applied to inventory databases. He has a lot of experience in data engineering, including working with Frictionless Data.

I noticed that in database-1.json we have the subcategory context. Should we work with a "merge" of the two1? I ask because I also couldn't find the subcategory context in your sample SimaPro CSV export line.

Yes, the context field is hierarchical, and can vary from one to four or five levels (see comment from Edelen above). At ecoinvent, we have made the decision to actually unite product and ecosphere[^1] flows as we have some substances that can act in the technosphere and the ecosphere (agricultural chemicals, industrial gases), and storing them separately was leading to inconsistencies. A given substance should be treated as a new object if it occurs in a different context.

You can see the available contexts (in one organization's perspective) here: https://glossary.ecoinvent.org/elementary-exchanges/. Of course, these will also evolve over time.

[^1]: We are using ecosphere because for us the biosphere is the part of the ecosphere dealing with living things, but we also characterize abiotic resources like mineral ores, fossil petroleum, etc.

tfardet commented 7 months ago

@fjuniorr this looks great, thanks for that! I'm overall in favor of the 2nd proposal in the initial post.

What follows is just me trying to figure out the update proposal by @cmutel in his reply, so feel free to ignore if it turns out to be nonsense.

I expect that we want, whenever possible, to just match unique ids and let the target software work with the equivalent objects with its own entries and values, converting whatever values should be changed in the exchanges only (e.g. if units vary).

So whenever possible

mapping_spec = {
  "match": [
    {
      "source": {
          "sumi": "80596FA1-6D62-4392-AC71-509E5F73D39E",
      },
      "target": {
          "@id": "be73218b-18af-492e-96e6-addd309d1e32",
      },
      "conversionFactor": 1,
      "MatchCondition": "=",
    ...
  ]
}

If there are not unique ids, we supply whatever is necessary to make the match. In the example given by @cmutel, as SimaPro is the one missing the unique ID, I would expect the source entry to contain the additional information needed for the match (name and context from what I understood), leading to:

mapping_spec = {
  "match": [
    {
      "source": {
          "sumi": "non-unique-simapro-id",
          "name": ... ,
          "context": ... ,
      },
      "target": {
          "@id": "unique-ecoinvent-id",
      },
      "conversionFactor": 1,
      "MatchCondition": "=",
    ...
  ]
}

but we don't modify any entry so that the software just works with what it has, as expected.

cmutel commented 7 months ago

If there are not unique ids, we supply whatever is necessary to make the match. In the example given by @cmutel, as SimaPro is the one missing the unique ID, I would expect the source entry to contain the additional information needed for the match (name and context from what I understood)

@tfardet you are totally correct, this is a poor example as the "sumi" is not enough to uniquely identify an object in their ontology.

fjuniorr commented 7 months ago

Agree 100% with what you said @tfardet!

@cmutel at least for know I don't see a problem in switching the current implementation[^20231119T173838] between any of the formats that we discussed[^20231119T172139], so it might come down do the use cases that we see in the future and alignment with fedelemflowlist and openLCA FlowMap.

[^20231119T172139]: With what I believe are some small gotchas such as https://github.com/fjuniorr/flowmapper/issues/13 and https://github.com/fjuniorr/flowmapper/issues/12

[^20231119T173838]: Which is using nested for loops 🙃

I've also created an example project that makes use of https://github.com/fjuniorr/flowmapper/pull/11 to generate mappings from SimaPro 9.4 and ecoinvent 3.7 using the existing logic of simapro_ecoinvent_elementary_flows.

msrocka commented 7 months ago

Interesting discussion; some points regarding the openLCA FlowMap schema:

fjuniorr commented 7 months ago

Thanks for the pointers and PR @msrocka!

fjuniorr commented 7 months ago

I think we are in a good place here. I will leave the final decision to you - I suspect that implementing these ideas will tell you the best way in any case. As you said, the way that client software works with the mappings is up to each software system.

At least for inspecting the flows that matched having the usual randonneur is nicer because I don't have to do such much lookup.

cmutel commented 6 months ago

@fjuniorr I think I understand this better now - before I more focused on some specific details but wasn't engaging with the fundamental question of mapping versus transformations. If I understand correctly, we are now focused on mappings across systems. That's fine, but I would like your opinion on how to handle transitive mappings. For example, today I was mapping a flow from SimaPro:

 {
  "name": "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore",
  "categories": [
   "Resources",
   "in ground"
  ],
  "unit": "kg"
 }

To ecoinvent version 3.10. This didn't match, because there was a change from ecoinvent version 3.9 to version 3.10; the flow is now:

  {
    "name": "Copper",
    "categories": [
      "natural resource",
      "in ground"
    ],
    "unit": "kilogram",
    "uuid": "a9ac40a0-9bea-4c48-afa7-66aa6eb90624",
    "CAS number": "007440-50-8",
    "synonyms": []
  }

How should we get this right? The 3.9 to 3.10, or any set of transformations, should be usable in our generation of mapping files. To me this is blurring the line between transformations and mapping...

fjuniorr commented 6 months ago

@cmutel first off, I think we definitely need transformations eventually. However storing only the mappings makes it easier to create specific transformations for different use cases. The methods Flowmap.to_randonneur and Flowmap.to_glad are good examples of this.

before I more focused on some specific details but wasn't engaging with the fundamental question of mapping versus transformations. If I understand correctly, we are now focused on mappings across systems.

I think we are concentrating on mappings across different systems mainly because:

That's fine, but I would like your opinion on how to handle transitive mappings. For example [...]

How should we get this right? The 3.9 to 3.10, or any set of transformations, should be usable in our generation of mapping files. To me this is blurring the line between transformations and mapping...

Let's ensure we're on the same page with your example[^20231213T094212].

The flow "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore" from SimaPro agribalyse-3.1.1-biosphere matched with flow 1aee4aa7-32e0-48e7-a6b5-73d8acf672d3 from ecoinvent-3.7-biosphere by "Resources with suffix in ground".

However, in ecoinvent-3.9-biosphere, the flow 1aee4aa7-32e0-48e7-a6b5-73d8acf672d3 lost its "in ground" suffix but still matched due to identical names. Yet, another flow, "Copper, Cu 5.2E-2%, Pt 4.8E-4%, Pd 2.0E-4%, Rh 2.4E-5%, Ni 3.7E-2% in ore", also matched to the same UUID in ecoinvent-3.7-biosphere due to minor name differences. Since we didn't update the dict with name differences, it didn't match in ecoinvent-3.9-biosphere.

The matches to UUID a9ac40a0-9bea-4c48-afa7-66aa6eb90624 in both agribalyse-3.1.1-biosphere-ecoinvent-3.7-biosphere and agribalyse-3.1.1-biosphere-ecoinvent-3.9-biosphere look fine.

Now, how should we address this? I don't think it's a blurring of lines between transformations and mapping. It's more a result of our current approach, where we don't use information from one mapping (like agribalyse-3.1.1-biosphere-ecoinvent-3.7-biosphere) to inform others (such as agribalyse-3.1.1-biosphere-ecoinvent-3.9-biosphere).

For example, since "Copper, Cu 5.2E-2%, Pt 4.8E-4%, Pd 2.0E-4%, Rh 2.4E-5%, Ni 3.7E-2% in ore" matched UUID 1aee4aa7-32e0-48e7-a6b5-73d8acf672d3 in ecoinvent-3.7, in theory, we have enough information to make it match in ecoinvent-3.9.

I'm not sure if the best approach is to save a state of certified mappings across specific systems and versions, or if we should assume the need to update the match rules to accommodate changes in the flow lists.

[^20231213T094212]: Here's a more complete breakdown. The original flow from SimaPro you mentioned agribalyse-3.1.1-biosphere is:

```json
  {
    "name": "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore",
    "unit": "kg",
    "categories": [
      "Resources",
      "in ground"
    ]
  }
```

The match with `ecoinvent-3.7-biosphere` was:

```json
  {
    "source": {
      "name": "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore",
      "categories": [
        "Resources",
        "in ground"
      ],
      "unit": "kg"
    },
    "target": {
      "uuid": "1aee4aa7-32e0-48e7-a6b5-73d8acf672d3",
      "name": "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore, in ground",
      "context": "natural resource/in ground",
      "unit": "kg"
    },
    "conversion_factor": 1.0,
    "comment": "Resources with suffix in ground"
  }
```

In `ecoinvent-3.9-biosphere` the flow `1aee4aa7-32e0-48e7-a6b5-73d8acf672d3` lost the "in ground" suffix but matched on "Identical names":

```json
  {
    "source": {
      "name": "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore",
      "categories": [
        "Resources",
        "in ground"
      ],
      "unit": "kg"
    },
    "target": {
      "uuid": "1aee4aa7-32e0-48e7-a6b5-73d8acf672d3",
      "name": "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore",
      "context": "natural resource/in ground",
      "unit": "kg"
    },
    "conversion_factor": 1.0,
    "comment": "Identical names"
  }
```

However "Copper, Cu 5.2E-2%, Pt 4.8E-4%, Pd 2.0E-4%, Rh 2.4E-5%, Ni 3.7E-2% in ore" also matched to `1aee4aa7-32e0-48e7-a6b5-73d8acf672d3` in `ecoinvent-3.7-biosphere` because of the dict with name differences: 

```json
  {
    "source": {
      "name": "Copper, Cu 5.2E-2%, Pt 4.8E-4%, Pd 2.0E-4%, Rh 2.4E-5%, Ni 3.7E-2% in ore",
      "categories": [
        "Resources",
        "in ground"
      ],
      "unit": "kg"
    },
    "target": {
      "uuid": "1aee4aa7-32e0-48e7-a6b5-73d8acf672d3",
      "name": "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore, in ground",
      "context": "natural resource/in ground",
      "unit": "kg"
    },
    "conversion_factor": 1.0,
    "comment": "Minor random name differences"
  }
```

Since `1aee4aa7-32e0-48e7-a6b5-73d8acf672d3` lost the "in ground" suffix in `ecoinvent-3.9-biosphere` and we didn't update the dict with name differences it didn't match.

The matches to `a9ac40a0-9bea-4c48-afa7-66aa6eb90624` look fine:

### agribalyse-3.1.1-biosphere-ecoinvent-3.7-biosphere

```json
[
  {
    "source": {
      "name": "Copper",
      "categories": [
        "Resources",
        "in ground"
      ],
      "unit": "kg"
    },
    "target": {
      "uuid": "a9ac40a0-9bea-4c48-afa7-66aa6eb90624",
      "name": "Copper, in ground",
      "context": "natural resource/in ground",
      "unit": "kg"
    },
    "conversion_factor": 1.0,
    "comment": "Resources with suffix in ground"
  },
  {
    "source": {
      "name": "Copper",
      "categories": [
        "Resources",
        ""
      ],
      "unit": "kg"
    },
    "target": {
      "uuid": "a9ac40a0-9bea-4c48-afa7-66aa6eb90624",
      "name": "Copper, in ground",
      "context": "natural resource/in ground",
      "unit": "kg"
    },
    "conversion_factor": 1.0,
    "comment": "Resources with suffix in ground"
  }
]
```

### agribalyse-3.1.1-biosphere-ecoinvent-3.9-biosphere

```json
[
  {
    "source": {
      "name": "Copper",
      "categories": [
        "Resources",
        "in ground"
      ],
      "unit": "kg"
    },
    "target": {
      "uuid": "a9ac40a0-9bea-4c48-afa7-66aa6eb90624",
      "name": "Copper",
      "context": "natural resource/in ground",
      "unit": "kg"
    },
    "conversion_factor": 1.0,
    "comment": "Identical names"
  },
  {
    "source": {
      "name": "Copper",
      "categories": [
        "Resources",
        ""
      ],
      "unit": "kg"
    },
    "target": {
      "uuid": "a9ac40a0-9bea-4c48-afa7-66aa6eb90624",
      "name": "Copper",
      "context": "natural resource/in ground",
      "unit": "kg"
    },
    "conversion_factor": 1.0,
    "comment": "Identical names"
  }
]
```
cmutel commented 6 months ago

OK with the decision that this library is focused on mappings, and that transformations are needed but can be generated from the mappings and are a separate unit of work.

cmutel commented 6 months ago

if we should assume the need to update the match rules to accommodate changes in the flow lists.

I don't see any way around this. The lists per system change, and we can't rely on things like uuid to be stable or map 1-to-1. For example, the flow Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore, and all copper resource flows are now just copper in version 3.10 (with a different uuid).

So I think we need to plan for generic transformations, and a config parameter to load specific mapping data based on the input/output combinations.

fjuniorr commented 6 months ago

So I think we need to plan for generic transformations, and a config parameter to load specific mapping data based on the input/output combinations.

It would be great if we could keep the match rules valid and generic for every input/output combination, but I agree that this can get messy or downright impossible.