Open ggael opened 2 years ago
Some updates, merge_csv.py now also print a summary report like this:
PYTHONPATH=. python tools/merge_csv.py boavizta-data-us.csv dell.csv -o /dev/null
------------------------------------------------------------
| Summary report |
------------------------------------------------------------
Number of singletons: 1235, 26
Number of self duplicates: 174, 2
Number of clean fusions: 455
Number of mixed fusions: 42
Number of attributes gathered from the oldest data: 122
------------------------------------------------------------
which is handy to quickly see if there is any issues. For instance, here this report means that 1235 items of boavizta-data-us.csv are not present in dell.csv, 26 items are presents in dell.csv but not in the current db, the current db contains 174 items having one (or more) duplicates (*), among the items that are in both files, 455 are fully covered by dell.csv, but for 42 items we found attributes in boavizta-data-us.csv that are not present in dell.csv.
(*) So far duplicates are detected solely based on the model name. This implies some false positives.
We need a dedicated tool to merge merge multiple .csv files while detecting and merging duplicates.
I've started to implement it through a new static method of
DeviceCarbonFootprint
:and a
merge_csv.py file1 file2
standalone script written on top of the abovemerge
function.By default, priority is given to device2/file2.
Conflicts are detected only for attributes that provided for both devices and when they are clearly different. If they are close enough, then merge only print a warning in verbose mode.
Then, there are two modes to resolve the conflicts:
TODO: