Duplicate CFs for the same flow

cmutel commented 4 years ago

From @romainsacchi. Some flows are matched twice, resulting in too high CFs. For example:

'Chromium VI' (kilogram, None, ('air', 'urban air close to ground')) 142047.56016444156 'Chromium VI' (kilogram, None, ('air', 'urban air close to ground')) 205267.54294881175

romainsacchi commented 4 years ago

After .apply_strategies(), one would expect to see the following for Chromium VI for Terrestrial ecotox (H) in exchanges:

chromium vi ('air', 'urban air close to ground') 142047.56016444156
chromium vi ('air',) 205267.54294881175
chromium vi ('air', 'low population density, long-term') 205267.54294881175
chromium vi ('air', 'non-urban air or from high stacks') 205267.54294881175

but instead we see:

chromium vi ('air', 'urban air close to ground') 142047.56016444156
chromium vi ('air',) 205267.54294881175
chromium vi ('air',) 205267.54294881175
chromium vi ('air',) 205267.54294881175
chromium vi ('air',) 205267.54294881175
chromium vi ('air',) 205267.54294881175
chromium vi ('air', 'low population density, long-term') 205267.54294881175
chromium vi ('air', 'non-urban air or from high stacks') 205267.54294881175

which, once the method is registered, translates into:

'Chromium VI' (kilogram, None, ('air', 'urban air close to ground')) 142047.56016444156
'Chromium VI' (kilogram, None, ('air', 'low population density, long-term')) 205267.54294881175
'Chromium VI' (kilogram, None, ('air', 'lower stratosphere + upper troposphere')) 205267.54294881175
'Chromium VI' (kilogram, None, ('air', 'non-urban air or from high stacks')) 205267.54294881175
'Chromium VI' (kilogram, None, ('air', 'urban air close to ground')) 205267.54294881175
'Chromium VI' (kilogram, None, ('air',)) 205267.54294881175
'Chromium VI' (kilogram, None, ('air', 'low population density, long-term')) 205267.54294881175
'Chromium VI' (kilogram, None, ('air', 'non-urban air or from high stacks')) 205267.54294881175

While there's clearly an issue during .apply_strategies(), it seems that in the presence of the ('air',) compartment, CF for all sub-compartments are created even though we have already existing (and different CF) for them. In this case, even though we have a CF for ('air', 'urban air close to ground'), another CF for ('air', 'urban air close to ground') was created when encountering the ("air",) compartment. That makes me wonder whether we should allow CFs for the general compartment ("air",) or if instead we should detail the sub-compartments.

romainsacchi commented 4 years ago

If instead we change the categories mapping from:

"ruralair": [
            ("air", ),
            ("air", "low population density, long-term"),
            ("air", "non-urban air or from high stacks"),
        ],

to

"ruralair": [
            ("air", "lower stratosphere + upper troposphere"),
            ("air", "low population density, long-term"),
            ("air", "non-urban air or from high stacks"),
        ],

we seem to get a more sensible outcome:

'Chromium VI' (kilogram, None, ('air', 'urban air close to ground')) 142047.56016444156
'Chromium VI' (kilogram, None, ('air', 'lower stratosphere + upper troposphere')) 205267.54294881175
'Chromium VI' (kilogram, None, ('air', 'low population density, long-term')) 205267.54294881175
'Chromium VI' (kilogram, None, ('air', 'non-urban air or from high stacks')) 205267.54294881175

However this cannot be a solution as flows to the ("air",) compartment would not be characterized.

romainsacchi commented 4 years ago

But the issue of duplicates persists.

for meth in methods_mid:
    list_CF=[]
    for m in bw.Method(meth).load():
        code = m[0][1]
        list_CF.append(code)

    if len(list_CF) != len(set(list_CF)):
        print(meth[3], meth[4], len(list_CF), len(set(list_CF)))

returns

Toxicity Carcinogenic 191 145
Toxicity Carcinogenic 276 214
Toxicity Carcinogenic 276 214
Ecotoxicity Terrestrial 802 635
Ecotoxicity Terrestrial 1004 800
Ecotoxicity Terrestrial 1004 800
Ecotoxicity Freshwater 802 635
Ecotoxicity Freshwater 1004 800
Ecotoxicity Freshwater 1004 800
Mineral Resource Scarcity Individualist 119 117
Mineral Resource Scarcity Hierarchist 119 117
Mineral Resource Scarcity Egalitarian 119 117
Toxicity Non-carcinogenic 574 459
Toxicity Non-carcinogenic 574 459
Toxicity Non-carcinogenic 574 459
Ecotoxicity Marine 802 635
Ecotoxicity Marine 1004 800
Ecotoxicity Marine 1004 800

romainsacchi commented 4 years ago

I think the piece of code responsible for the duplicates is this one:

[
                        add_flow(deepcopy(cf), ds)
                        for ds in other_dict[cf["name"].lower()]
                        if category_match(ds, cf)
                    ]

in match_multiple() in matching.py

I tried with category_match_exact() instead, and this solves the duplicates problem, but also leads to some impact categories having no CFs...

romainsacchi commented 4 years ago

I then tried this in matching.py>match_multiple():

if cf["name"].lower() in other_dict:

                flows_to_add = [
                        add_flow(deepcopy(cf), ds)
                        for ds in other_dict[cf["name"].lower()]
                        if category_match(ds, cf)
                    ]
                # make sure we are not adding twice the same flow
                to_add.extend(
                    [f for f in flows_to_add
                     if f["input"] not in [x["input"] for x in to_add]
                     ]
                )

                to_remove.append(cf)

as well as changing the order in:

def set_toxicity_categories(data):
    category_mapping = {
        "agriculturalsoil": [("soil", "agricultural")],
        "freshwater": [("water", "surface water"),
                              ("water", "ground-"),
                              ("water",) # <--------------------- this one placed last
                             ],
        "industrialsoil": [("soil", "industrial")],
        "ruralair": [
            ("air", "lower stratosphere + upper troposphere"),
            ("air", "low population density, long-term"),
            ("air", "non-urban air or from high stacks"),
            ("air",), # <--------------------- this one placed last
        ],
        "seawater": [("water", "ocean")],
        "urbanair": [("air", "urban air close to ground"), ("air", "indoor"),],
    }

to ensure that only flows that were not previously added are effectively added. This seems to improve the result for Terrestrial ecotox. by a lot, as well as carcinogenic human tox. Oddly enough, it worsened the results by a little for Marine ecotox., land use and ozone depletion.

cmutel commented 4 years ago

Edited title to clarify that there are two issues: this one (being discussed by @romainsacchi), and #4.

cmutel commented 4 years ago

The problem here is that match_multiple matches (x,) to (x,) and (x, y), but we are specifying both (x,) and (x, y) as possible categories. Simply need a different matching function - match_multiple should only be used in cases where all air, water, etc. flows should be matched.

romainsacchi commented 4 years ago

Indeed. Maybe (x,) -> (x,y) should be allowed as a last resort, if there was no (x, y) -> (x,y) matches previously found. This is what I tried in my last comment.

cmutel commented 4 years ago

I think this should be fixed with 60525b5802be2008f698d1ecfb1621c55b6a5a29. But #4 is still a problem.

cmutel commented 4 years ago

Should be fixed with recent changes; we now check if multiple instances of the same flow are characterized.

brightway-lca / bw_recipe_2016

Duplicate CFs for the same flow #3