catalyst-cooperative / rmi-ferc1-eia

A collaboration with RMI to integrate FERC Form 1 and EIA CapEx and OpEx reporting
MIT License
3 stars 3 forks source link

investigate the murky wins #31

Closed cmgosnell closed 4 years ago

cmgosnell commented 4 years ago

with the new relabeling of the false granularities, now ~20% of the matches are "murky wins" which means the best result from the model is not very distinct from the second best result. I'd like to determine whether using 1 IQR is too high of a bar for measuring distinctiveness. Are these truly murky or have the average weighted score of the matches generally narrowed?

image.png

cmgosnell commented 4 years ago

Hoookay. I'm about to check in some changes (mostly bug fixes I've found through this investigation). I've narrowed in on a distinction ratio - used to determine whether a winning match is murky or not - of about .2 of the iqr of the possible matches. And with a few bug fixes the results are looking relatively reasonable.

image.png

cataloging murky wins by record_id_ferc

truly ambiguous plants:

solvable w/ a combinatorial record merge

solvable ambiguous plants:

solvable... w/ better capacity allocation across owners:

checked and looks good.. just low diffs

checked and murk resulted in wrong winner.. barely