Generate unique id for flow class if not present in input flowlist

fjuniorr commented 7 months ago

For reporting purposes (such as #25 and #26) is very useful to have a unique id for a flow even if it's not present in the original flow list.

My idea is to add a new field id to the Flow class that would be populated either with the provided uuid or with a generated one.

There are at least two relevant implementation decisions.

algorithm

My initial idea would be to do something like:

def generate_flow_id(flow: dict):
    flow_str = json.dumps(flow, sort_keys=True)
    result = hashlib.md5(flow_str.encode('utf-8')).hexdigest()
    return result

Because at least for now I didn't see the need to identify flows across flowlists without id and therefore their properties in principle don't need to be normalized.

Some other notable options would be (both do some format of normalization)

generated id in results

Take for example this two matching flows:

sp =  {
    "name": "1,4-Butanediol",
    "categories": [
      "Air",
      "(unspecified)"
    ],
    "unit": "kg",
    "CAS": "000110-63-4"
  }

ei =   {
    "@id": "09db39be-d9a6-4fc3-8d25-1f80b23e9131",
    "@unitId": "487df68b-4994-4027-8fdc-a4dc298257b7",
    "@casNumber": "000110-63-4",
    "name": {
      "@xml:lang": "en",
      "#text": "1,4-Butanediol"
    },
    "unitName": {
      "@xml:lang": "en",
      "#text": "kg"
    },
    "compartment": {
      "@subcompartmentId": "7011f0aa-f5f9-4901-8c10-884ad8296812",
      "compartment": {
        "@xml:lang": "en",
        "#text": "air"
      },
      "subcompartment": {
        "@xml:lang": "en",
        "#text": "unspecified"
      }
    },
    "synonym": {
      "@xml:lang": "en",
      "#text": "Butylene glycol"
    }
  }

Still following randonneur data migration format we would not add the generated id to the source otherwise randonneur.utils.matcher would not match (ie. it does not exist in the source dict):

  {
    "source": {
      "name": "1,4-Butanediol",
      "categories": [
        "Air",
        "(unspecified)"
      ]
    },
    "target": {
      "@id": "09db39be-d9a6-4fc3-8d25-1f80b23e9131"
    },
    "conversionFactor": 1,
    "comment": "Identical names"
  }

For reference, in OpenLCA flow mapping file the FlowMapEntry adds the generated id if one one UUID is not provided[^20231126T204229]

[^20231126T204229]: Generated with

```python
import uuid
str(uuid.uuid3(uuid.NAMESPACE_OID, 'flow/air/(unspecified)/1,4-butanediol'))
```

        {
            "from":
            {
                "flow":
                {
                    "@id": "2abe0077-a051-3d5e-b064-c2c6e5f5bf3c",
                    "category": "Air/(unspecified)",
                    "name": "1,4-Butanediol"
                },
                "unit":
                {
                    "@type": "Unit",
                    "@id": "20aadc24-a391-41cf-b340-3e4529f44bde",
                    "name": "kg"
                }
            },
            "to":
            {
                "flow":
                {
                    "@id": "f871bd90-342d-3ccf-8875-b15e396d2488",
                    "category": "emission/air",
                    "name": "1,4-Butanediol"
                },
                "unit":
                {
                    "@type": "Unit",
                    "@id": "20aadc24-a391-41cf-b340-3e4529f44bde",
                    "name": "kg"
                }
            },
            "conversionFactor": 1.0
        }

fjuniorr commented 7 months ago

@cmutel it would be great to hear if you have any preferences in this case.

cmutel commented 7 months ago

For reporting this makes sense and the proposed hash is fine. I really like the current approach without normalization, as it is much clearer and easier to communicate how to use the generated mappings.

I am really reluctant to add a field to the produced mapping which is not in the source data, even if it is reproducible. Everyone using the mapping lists will need to special case the logic for that field, and it is so much easier to just test if all provided fields match. Moreover, I think some people could assume that an id field is definitive and only try to match against that (and not find anything).

So I would prefer to make this a configurable option with a negative default.

fjuniorr commented 7 months ago

I've used the generate_flow_id and I agree that the id should be left out.

I will track the option https://github.com/fjuniorr/flowmapper/issues/29 and close this to focus on the new match strategies.

fjuniorr / flowmapper

Generate unique id for flow class if not present in input flowlist #27

algorithm

generated id in results