UNEP-Economy-Division / GLAD-ElementaryFlowResources

This is a repository for elementary flow lists, mapping files, and other resources pertinent to the Global LCA Data Access (GLAD) network initiative and data portal. If you want to contribute please contact Jonathas De Mello [jonathas.demello@un.org] and Claudia Giacovelli [claudia.giacovelli@un.org].
https://www.globallcadataaccess.org
26 stars 13 forks source link

Duplicate rows in FEDEFL <--> ecoinvent mapping files #17

Open a-w-beck opened 9 months ago

a-w-beck commented 9 months ago

Both ecoinventEFv3.7-FEDEFLv1.0.3.xlsx and FEDEFLv1.0.3-ecoinventEFv3.7.xlsx contain thousands of duplicate rows; detected using pd.DataFrame.duplicated

a-w-beck commented 7 months ago

Made another quick pass at characterizing this issue

Script (stored in repo root dir)

from pathlib import Path

import pandas as pd

path_repo = Path(__file__).parent

path_map = path_repo / 'Mapping' / 'Output' / 'Mapped_files' 
fpaths_map = list(path_map.glob('*.xlsx'))

dups = dict()
for fpath in fpaths_map:
    dups[fpath.name] = sum(pd.read_excel(fpath).duplicated())
dups

Output

{'ecoinventEFv3.7-FEDEFLv1.0.3.xlsx': 3714,
 'ecoinventEFv3.7-IDEAv2.2.xlsx': 507,
 'ecoinventEFv3.7-IDEAv2.3.xlsx': 607,
 'ecoinventEFv3.7-ILCD-EFv3.0.xlsx': 894,
 'FEDEFLv1.0.3-ecoinventEFv3.7.xlsx': 51140,
 'FEDEFLv1.0.3-IDEAv2.2.xlsx': 8211,
 'FEDEFLv1.0.3-IDEAv2.3.xlsx': 9885,
 'FEDEFLv1.0.3-ILCD-EFv3.0.xlsx': 13799,
 'IDEAv2.2-ecoinventEFv3.7.xlsx': 15,
 'IDEAv2.2-FEDEFLv1.0.3.xlsx': 86,
 'IDEAv2.2-ILCD-EFv3.0.xlsx': 66,
 'IDEAv2.3-ecoinventEFv3.7.xlsx': 82,
 'IDEAv2.3-FEDEFLv1.0.3.xlsx': 182,
 'IDEAv2.3-ILCD-EFv3.0.xlsx': 149,
 'ILCD-EFv3.0-ecoinventEFv3.7.xlsx': 2759,
 'ILCD-EFv3.0-FEDEFLv1.0.3.xlsx': 3680,
 'ILCD-EFv3.0-IDEAv2.2.xlsx': 2609,
 'ILCD-EFv3.0-IDEAv2.3.xlsx': 2871}
bnjmnmorelli commented 1 month ago

I found 3710 duplicate rows in Excel for the original file mentioned.