Open a-w-beck opened 9 months ago
Made another quick pass at characterizing this issue
from pathlib import Path
import pandas as pd
path_repo = Path(__file__).parent
path_map = path_repo / 'Mapping' / 'Output' / 'Mapped_files'
fpaths_map = list(path_map.glob('*.xlsx'))
dups = dict()
for fpath in fpaths_map:
dups[fpath.name] = sum(pd.read_excel(fpath).duplicated())
dups
{'ecoinventEFv3.7-FEDEFLv1.0.3.xlsx': 3714,
'ecoinventEFv3.7-IDEAv2.2.xlsx': 507,
'ecoinventEFv3.7-IDEAv2.3.xlsx': 607,
'ecoinventEFv3.7-ILCD-EFv3.0.xlsx': 894,
'FEDEFLv1.0.3-ecoinventEFv3.7.xlsx': 51140,
'FEDEFLv1.0.3-IDEAv2.2.xlsx': 8211,
'FEDEFLv1.0.3-IDEAv2.3.xlsx': 9885,
'FEDEFLv1.0.3-ILCD-EFv3.0.xlsx': 13799,
'IDEAv2.2-ecoinventEFv3.7.xlsx': 15,
'IDEAv2.2-FEDEFLv1.0.3.xlsx': 86,
'IDEAv2.2-ILCD-EFv3.0.xlsx': 66,
'IDEAv2.3-ecoinventEFv3.7.xlsx': 82,
'IDEAv2.3-FEDEFLv1.0.3.xlsx': 182,
'IDEAv2.3-ILCD-EFv3.0.xlsx': 149,
'ILCD-EFv3.0-ecoinventEFv3.7.xlsx': 2759,
'ILCD-EFv3.0-FEDEFLv1.0.3.xlsx': 3680,
'ILCD-EFv3.0-IDEAv2.2.xlsx': 2609,
'ILCD-EFv3.0-IDEAv2.3.xlsx': 2871}
I found 3710 duplicate rows in Excel for the original file mentioned.
Both ecoinventEFv3.7-FEDEFLv1.0.3.xlsx and FEDEFLv1.0.3-ecoinventEFv3.7.xlsx contain thousands of duplicate rows; detected using
pd.DataFrame.duplicated