Boavizta / environmental-footprint-data

💾 Boavizta.org Data repository
118 stars 32 forks source link

Add a script to automatically merge multiple .csv files and deal with duplicates #65

Open ggael opened 2 years ago

ggael commented 2 years ago

We need a dedicated tool to merge merge multiple .csv files while detecting and merging duplicates.

I've started to implement it through a new static method of DeviceCarbonFootprint:

@staticmethod
    def merge(device1: 'DeviceCarbonFootprint', device2: 'DeviceCarbonFootprint',
              conflict: Literal['keep2nd','interactive'] = 'keep2nd', verbose: bool = False) -> 'DeviceCarbonFootprint':

and a merge_csv.py file1 file2 standalone script written on top of the above merge function.

By default, priority is given to device2/file2.

Conflicts are detected only for attributes that provided for both devices and when they are clearly different. If they are close enough, then merge only print a warning in verbose mode.

Then, there are two modes to resolve the conflicts:

  1. Simply keep device2 (and print the differences in verbose mode)
  2. Ask the user which version should be kept.

TODO:

  1. Add a non-regression mode only testing that device2 is consistent with device1 and that device1 does not contain more information.
  2. Cleanup and unify some entries prior to fusion to avoid false negative (i.e., CN versus China, issue #64)
  3. Find a way to deal with PCF files reporting the same model name whereas they are not the same (in ecodiag I also extract the model name from the main html files)
ggael commented 2 years ago

Some updates, merge_csv.py now also print a summary report like this:

PYTHONPATH=. python tools/merge_csv.py boavizta-data-us.csv dell.csv  -o /dev/null

------------------------------------------------------------
| Summary report                                           |
------------------------------------------------------------
Number of singletons: 1235, 26
Number of self duplicates: 174, 2
Number of clean fusions: 455
Number of mixed fusions: 42
Number of attributes gathered from the oldest data: 122
------------------------------------------------------------

which is handy to quickly see if there is any issues. For instance, here this report means that 1235 items of boavizta-data-us.csv are not present in dell.csv, 26 items are presents in dell.csv but not in the current db, the current db contains 174 items having one (or more) duplicates (*), among the items that are in both files, 455 are fully covered by dell.csv, but for 42 items we found attributes in boavizta-data-us.csv that are not present in dell.csv.

(*) So far duplicates are detected solely based on the model name. This implies some false positives.