Annotate remaining database fields that lack metadata

catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.

MIT License

456 stars 105 forks source link

The following tables and fields currently lack metadata. The abbr ones should probably be turned into readable descriptive tags that are part of an ENUM (contianing whatever is in the value field that they are referring to), and ultimately stripped out of the database structure entirely.

boiler_generator_assn_eia860
- report_date = Column(Date, nullable=False)
- generator_id = Column(String)
- boiler_id = Column(String)
- unit_id_eia = Column(String)
- unit_id_pudl = Column(Integer, nullable=False)
- bga_source = Column(String)
fuel_type_eia923
- abbr = Column(String, primary_key=True)
- fuel_type = Column(String, nullable=False)
fuel_type_aer_eia923
- abbr = Column(String, primary_key=True)
- fuel_type = Column(String, nullable=False)
prime_movers_eia923
- abbr = Column(String, primary_key=True)
- prime_mover = Column(String, nullable=False)
energy_source_eia923
- abbr = Column(String, primary_key=True)
- source = Column(String, nullable=False)
natural_gas_transport_eia923
- abbr = Column(String, primary_key=True)
- status = Column(String, nullable=False)
transport_modes_eia923
- abbr = Column(String, primary_key=True)
- mode = Column(String, nullable=False)
fuel_receipts_costs_eia923
- moisture_content_pct = Column(Float)
- chlorine_content_ppm = Column(Float)
Pretty much entire entities.py file

The whole reason these little 2-column abbr tables exist is to map an abbreviation to a full name. But that's dumb. We should just store the full, human readable name, and do away with the abbreviation entirely. So the need for the table will go away, and be replaced with a translation of the reported codes to human readable full names in the transform step, with the resulting fixed list of acceptable values making up an ENUM type. This is probably something that should be done in conjunction with the move from the DB to the data packages. @cmgosnell how/when do you think we should deal with this? Modify the database structure and transform step now, so that the metadata extraction for the data packages is correct? Or wait until after the transition, and then simplify the JSON TableSchema by hand in conjunction with altering the transform step?

It does seem like there's going to be an issue of duplicated data in here somewhere -- in that there are several fields which exist across various tables which should be subject to the same ENUM constraints. How do we keep from having to update every last one of them by hand whenever the list of acceptable values changes?

catalyst-cooperative / pudl

Annotate remaining database fields that lack metadata #333