catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Replace all string code values from EIA923 and EIA860 with full names #869

Closed aesharpe closed 1 year ago

aesharpe commented 3 years ago

Many of the datasets we work with contain columns populated with string codes whose meaning can only be deciphered by reading a separate layout file. EX: C = Contract or NC = New Contract. Currently, the ETL outputs static csv files that contain the string code to full name and definition mapping, but this information is not readily available to those strictly accessing data through the SQL database or pandas output tables. This issue aims to eliminate all string codes in the database and replace them with their full names.

The constants file contains a number of dictionaries that already map the string codes to a few sentences of definition. EX: {'C': 'Contract -- Fuel received under a purchase order....'}

I propose that:

zaneselvans commented 3 years ago

In the service of cleaning out constants.py and getting constants stored closer to where they're used, do the original string codes need to be stored anywhere other than the transform module that they are used in? For metadata purposes, it seems like we could include only the full name and the long description, and ideally in our big metadata collection, have the explicit mapping from full name to description enumerated, and use that to construct the column-level ENUM and descriptive metadata programmatically.

zaneselvans commented 1 year ago

The use of coding tables to document the short codes and enforce a set of allowable values is taking care of this now.