Switch to more flexible generic matching format

cmutel commented 6 months ago

The values given in https://github.com/fjuniorr/flowmapper/blob/main/flowmapper/constants.py are not universal, nor are they provided in a form which could allow for more specific matches (i.e. change name but only in a specific context or with a specific unit). These matches are also specific to going from SimaPro to ecoinvent, but these are not the only two systems we use.

sp-formatted.json

The attached is a different approach - one that should have tooling already used to it's format and data model, and I would like to switch from the hard-coded values and systems to an option in the flowmapper-ci config and a data file like the one attached.

I will iterate on this file once it is added. It can currently replace the MINOR_LAND_NAME_DIFFERENCES_MAPPING, and should eventually replace RANDOM_NAME_DIFFERENCES_MAPPING, NAME_DIFFERENCES_WITH_UNIT_CONVERSION_MAPPING, and MISSING_FOSSIL_AND_BIOGENIC_CARBON_MAPPING.

fjuniorr commented 6 months ago

When there is no context in source and target, we still need equal context for a match, correct? But when you add a context to either source or target I should disregard flowmapper current context mapping and check the values provided?

Ps. So far this is the only entry that has a categories key (is this missing a name in target?):

  {
    "target": {
      "categories": [
        "Emissions to air",
        "low. pop., long-term"
      ]
    },
    "source": {
      "name": "Cesium-134",
      "categories": [
        "Emissions to air",
        "low. pop."
      ],
      "unit": "kBq"
    }
  }

cmutel commented 6 months ago

The cesium example is correct.

These are data transformations, ie they should be applied to the Flow object in order to get a match.

cmutel commented 6 months ago

The model I had in mind is as long as every element in the source matches you apply the transformation in target. Therefore we don't need a context in cases with the transformation should be applied for every possible context is it

fjuniorr commented 6 months ago

These are data transformations, ie they should be applied to the Flow object in order to get a match.

I had not fully understood this until now, but it does make a lot of sense.

I've added a first working version in https://github.com/fjuniorr/flowmapper/pull/70/commits/c6481f0f24ebfd0e3804754af24ca5c16f75dcae, but there are some changes that need to be made:

I'm changing the source flows with randonneur before they are initialized, which means I loose access to the original flows and can't write the proper output files (only realized how big a problem this is while writing)
I'm still not sure how to store the conversion_factor from "Ammonia, as N" (but in a certain way confirms that conversion factor belong to the Flow, not the Unit)
for now I've added this to the flowmapper map CLI command as a flag --transformations. From within python we need to do the transformations manually[^1].

[^1]: Something like:

```python
from flowmapper.utils import read_field_mapping, read_flowlist, read_migration_file
from flowmapper.flow import Flow
from flowmapper.flowmap import Flowmap
from randonneur import migrate_datasets

fields = read_field_mapping('config/simapro-ecoinvent.py')
source_flows = read_flowlist('data/agribalyse-3.1.1-biosphere.json')
migration_spec = read_migration_file('config/sp-formatted.json')
migrate_datasets(migration_spec, source_flows)
source_flows = [Flow.from_dict(flow, fields['source']) for flow in source_flows]

target_flows = [Flow.from_dict(flow, fields['target']) for flow in read_flowlist('data/ecoinvent-3.7-biosphere.json')]

flowmap = Flowmap(source_flows, target_flows)
flowmap.statistics()
```

fjuniorr commented 6 months ago

@cmutel could you confirm my understanding that if we have a source flow such as

  {
    "name": "Transformation, to water courses, artificial",
    "unit": "m2",
    "categories": [
      "Resources",
      "land"
    ]
  }

and the following transformation to be applied:

  {
    "source": {
      "name": "Transformation, to water bodies, artificial"
    },
    "target": {
      "name": "Transformation, to river, artificial"
    }
  }

The output of Flowmap.to_randonneur should still be (ie. the name continues to be "Transformation, to water courses, artificial" and not "Transformation, to river, artificial"):

  {
    "source": {
      "name": "Transformation, to water courses, artificial",
      "categories": [
        "Resources",
        "land"
      ],
      "unit": "m2"
    },
    "target": {
      "uuid": "090e9aa9-a9a9-4878-9634-3ad0ba7fbc91",
      "name": "Transformation, to river, artificial",
      "context": "natural resource/land",
      "unit": "m2"
    },
    "conversion_factor": 1.0,
    "comment": "Minor land name differences"
  },

I think this will be a somewhat bigger change that's why I'm making sure.

cmutel commented 6 months ago

Yes, exactly.

We have a Flow object with the original data stored in raw. We apply transformations (not mapping) for normalizing the raw data, and for things which don't fit into our normal mapping functions. For example, we will have cases which need to be mapped manually. But the resulting output is a mapping, not a transformation, from the original source data to the original target data.

fjuniorr commented 6 months ago

I'm changing the source flows with randonneur before they are initialized, which means I loose access to the original flows and can't write the proper output files (only realized how big a problem this is while writing)

I still need to add a couple more tests and cleanup the code but this is working as expected after https://github.com/fjuniorr/flowmapper/pull/70/commits/6b955a29a718e6a6e5128af474e449e9228679e1 in https://github.com/fjuniorr/flowmapper/pull/70. A call from the CLI with multiple data migration files looks like:

flowmapper map data/agribalyse-3.1.1-biosphere.json data/ecoinvent-3.7-biosphere.json \
               --fields config/simapro-ecoinvent.py \
               -t config/transformations.json \
               -t config/sp-formatted.json

Two questions @cmutel:

Can I already remove all the dicts with name differences mappings and we eventually catch up with what is missing in the data migration files?
Can you evaluate how much of a problem the other hard-coded constants (CONTEXT_MAPPING, UNITS_NORMALIZATION and ECOINVENT_UUID_39_310_MAPPING) are? Do you see us moving away from them as well in favor of randonneur data migration files?

cmutel commented 6 months ago

Can I already remove all the dicts with name differences mappings and we eventually catch up with what is missing in the data migration files?

Sure, I can pick up on the things I missed from source control or the original files I sent you.

Can you evaluate how much of a problem the other hard-coded constants (CONTEXT_MAPPING, UNITS_NORMALIZATION and ECOINVENT_UUID_39_310_MAPPING) are? Do you see us moving away from them as well in favor of randonneur data migration files?

CONTEXT_MAPPING is specific to SimaPro, so this should be configurable. But I think we can leave this as a builtin.
UNITS_NORMALIZATION is pretty generic. Leave for now.
ECOINVENT_UUID_39_310_MAPPING is very specific - we will need one of these for 3.6, 3.7, 3.8, etc. Should be configurable.

cmutel commented 6 months ago

@fjuniorr Here is an more complete mapping file constructed manually

sp-formatted.json

fjuniorr / flowmapper

Switch to more flexible generic matching format #69