catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Replace recordlinkage dependency in FERC to EIA match #2486

Closed katie-lamb closed 7 months ago

katie-lamb commented 1 year ago

The FERC1 to EIA matching module (pudl.analysis.ferc1_eia) uses the recordlinkage package to create feature vectors for comparison. The recordlinkage package hasn't had a release recently and it seems like it might be less maintained moving forward so it would be good to replace this dependency with something else.

I think this feature creation could be replaced with functionality from splink or sklearn or a combo. The splink Comparisons library works best with a splink linker that will then do a prediction, but it might work to then use the sklearn Logistic Regression model that's currently implemented in the ferc1_eia module. It might just be easier to use sklearn the whole way through.

zaneselvans commented 1 year ago

I don't know if we should necessarily do this, it's just something that worried me a bit looking into getting Pandas 2.0 working (See #2394 / #2320). And all else being equal the fewer different systems we have doing one job across the project, the easier it'll be to maintain.

If there's a simple drop-in replacement that's great! If it's gonna be more work, or would be a relatively brittle setup, then maybe we should try and work around the dependency issues for the moment somehow.