catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Use pudl-metadata.json file to make Datasette browsing of PUDL data more powerful #1198

Open MichaelTiemannOSC opened 2 years ago

MichaelTiemannOSC commented 2 years ago

Is your feature request related to a problem? Please describe. When I use the Datasette browser, the facets it suggests are not always the most appropriate for the data set. It would be much better to pre-load suggested facets by defining them in a metadata.json file that Datasette can load. Example syntax is given in the Datasette documentation:

{
    "databases": {
        "sf-trees": {
            "tables": {
                "Street_Tree_List": {
                    "facets": ["qLegalStatus"]
                }
            }
        }
    }
}

Even better, many columns are defined by well-known units. When such units are specified, pint can format nicely:

{
    "custom_units": [
        "USD = []",
        "MMBTU = 1e6 * BTU",
        "fraction = [] = frac",
        "percent = 1e-2 frac = pct",
        "ppm = 1e-6 fraction"
    ],
}
...
    "databases": {
        "pudl": {
            "tables": {
                "boiler_fuel_eia923": {
                    "units": {
                        "sulfur_content_pct": "pct",
                        "ash_content_pct": "pct"
                    },
                    "size": 10
                }
            }
        }
    }
}

Describe the solution you'd like I have started work on a pudl-metadata.json file, but whatever I'm doing should be properly rooted in how the PUDL team wants to manage the metadata. So I'm asking whether to create a fresh file that just covers metadata for Datasette output, or whether this should be merged in with higher-level metadata. I have seen from other issues that metadata use for output is "out of scope" for all current metadata issues. Once such a metadata file is established, the community can work on making the queries, facets, outputs, etc., whatever the community needs.

Describe alternatives you've considered None. I want to make Datasette sing.

Additional context This is the docker-compose yaml command I'm using to coordinate Datasette as a browser for Jupyter Notebooks running separately (I've learned its a bad idea to try to integrate Datasette browsing too much in the execution flow of my notebooks):

version: "3"
services:
    notebook:
        image: notebook
        volumes:
            - ./notebook:/home/jovyan/work
        ports:
            - 8888:8888

    pudl:
        image: datasette
    depends_on:
            - datasette
    volumes:
            - ./notebook/database:/work
    ports:
            - 8002:8002
    command: serve /work/pudl.sqlite --cors --setting base_url /pudl-datasette/ -p 8002 -h 0.0.0.0 --metadata /work/pudl-metadata.json
zaneselvans commented 2 years ago

Hey @MichaelTiemannOSC, we're definitely going to flesh out the Datasette metadata a bunch more going forward. We're in the process of integrating this big PR #806 which changes how we store the metadata internally, which will make it easier to update and produce metadata output formatted appropriately for different applications, including at least:

However, we haven't started mapping out what exactly the metadata for Datasette or Intake will need to look like yet, since we want to get the current metadata working in the new internal container first. If you feel like looking at the new pudl.metadata subpackage and thinking about what information we should be storing in there to enable easy export of metadata for a good Datasette experience, that would be wonderful. I'm not familiar with pint but explicitly storing units associated with the columns is definitely something we've discussed in the past.

See also #426

MichaelTiemannOSC commented 2 years ago

Zane, I took a look at https://github.com/catalyst-cooperative/pudl/blob/main/src/pudl/load/metadata.py which I presume is, or largely is, the puld.metadata subpackage you referred to. If not, please advise.

I also note that Datasette now has a remote metadata feature, which means that anybody who hosts the Datasette instance can decide how they would like to manage Datasette output: https://datasette.io/plugins/datasette-remote-metadata Of course it would be great for the PUDL team (which includes the larger community) to raise the baseline as much as possible, but it means that people can curate their output experiences, much as https://github.com/catalyst-cooperative/pudl-query-library encourages people to curate queries.

zaneselvans commented 2 years ago

Hey @MichaelTiemannOSC, no not that module (oh god, not that module!) -- we're working on a branch that's associated with PR #806. Check out the metadata subpackage on that branch.

katie-lamb commented 2 years ago

Hi @MichaelTiemannOSC , I am currently working on integrating metadata with Datasette and this should be wrapped up soon! Check out the proposed changes here. The biggest change is that Datasette now includes descriptions for data source, table, and column levels and we have the structure in place to dynamically update our Datasette metadata.

While this first round of metadata integration probably won't include units and facets, both are things we'd like to integrate as next steps. As you point out above, I think we'll likely need to compile a list of hand selected facets for certain tables to make facet suggestions better. Are there any specific tables that you've encountered where the facet suggestions were not helpful and you would have liked other columns to serve as facets?

I appreciate your feedback and attention to detail on all this!

MichaelTiemannOSC commented 2 years ago

Hi @katie-lamb, excited to see you working on this project. I've been working on other topics for the past 6 months, but now that I know the iron is hot I will return some focus to this area.

As a sidebar, our project is using Trino as a central query coordinator, which has dozens of connectors to other well-known databases (such as PostgreSQL). Surprisingly, there's no dedicated SQLite connector (though there is a SqlAlchemy connector). If there's a way to better connect the dots between Trino and SQLite , we are all ears.

MichaelTiemannOSC commented 2 years ago

I looked back at my notes and found the need to explicitly request facets for fuel_type_code in boiler_fuel_eia923. I think this was needed because there are surprisingly many fuel_type_codes (coal by any other name is coal to me), and Datasette modestly decides against selecting that as a facet-able dimension.

Here's where I got to before shifting attention to other data and analytic topics (pudl-metadata.json):

{
    "title": "PUDL 0.4.0 Dataset",
    "description": "Public Utilities Data from EIA, FERC, and US CENSUS databases",
    "license": "MIT License",
    "license_url": "https://github.com/catalyst-cooperative/pudl/blob/main/LICENSE.txt",
    "source": "The Public Utility Data Liberation Project (PUDL)",
    "source_url": "https://github.com/catalyst-cooperative/pudl",
    "custom_units": [
        "USD = []",
        "Mcf = 1e6 * feet**3 = _ = mcf",
        "gramsU = 1 * gram",
        "kgU = 1000 * gramsU",
        "mwhth = MWh",
        "mwdth = 24 * mwhth",
        "MMBTU = 1e6 * BTU = _ = mmbtu",
        "fraction = [] = frac",
        "percent = 1e-2 frac = pct",
        "ppm = 1e-6 fraction"
    ],
    "databases": {
        "pudl": {
            "queries": {
            },
            "tables": {
                "assn_gen_eia_unit_epa": {
                    "description_html": "Custom <em>table</em> description",
                    "license": "MIT License",
                    "license_url": "https://github.com/catalyst-cooperative/pudl/blob/main/LICENSE.txt",
                    "size": 10
                },
                "assn_plant_id_eia_epa": {
                    "size": 10
                },
                "boiler_fuel_eia923": {
                    "facets": ["fuel_type_code"],
                    "units": {
                        "sulfur_content_pct": "pct",
                        "ash_content_pct": "pct"
                    },
                    "size": 10
                },
                "boiler_generator_assn_eia860": {
                    "size": 10
                },
                "boilers_entity_eia": {
                    "size": 10
                },
                "coalmine_eia923": {
                    "size": 10
                },
                "energy_source_eia923": {
                    "size": 10
                },
                "ferc_accounts": {
                    "size": 10
                },
                "ferc_depreciation_lines": {
                    "size": 10
                },
                "fuel_ferc1": {
                    "units": {
                        "fuel_cost_per_mmbtu": "USD/MMBTU"
                    },
                    "size": 10
                },
                "fuel_receipts_costs_eia923": {
                    "units": {
                        "fuel_cost_per_mmbtu": "USD/MMBTU"
                    },
                    "size": 10
                },
                "fuel_type_aer_eia923": {
                    "size": 10
                },
                "fuel_type_eia923": {
                    "size": 10
                },
                "generation_eia923": {
                    "units": {
                        "net_generation_mwh": "MWh"
                    },
                    "size": 10
                },
                "generation_fuel_eia923": {
                    "units": {
                        "fuel_consumed_mmbtu": "MMBTU",
                        "net_generation_mwh": "MWh"
                    },                   
                    "size": 10
                },
                "generators_eia860": {
                    "units": {
                        "capacity_mw": "MW",
                        "summer_capacity_mw": "MW",
                        "winter_capacity_mw": "MW",
                        "net_generation_mwh": "MWh",
                        "planned_net_summer_capacity_uprate_mw": "MW",
                        "planned_net_winter_capacity_uprate_mw": "MW",
                        "planned_net_summer_capacity_derate_mw": "MW",
                        "planned_net_winter_capacity_derate_mw": "MW",
                        "planned_new_capacity_mw": "MW",
                        "minimum_load_mw": "MW",
                        "summer_estimated_capability_mw": "MW",
                        "winter_estimated_capability_mw": "MW"
                    },                   
                    "size": 10
                },
                "generators_entity_eia": {
                    "size": 10
                },
                "ownership_eia860": {
                    "size": 10
                },
                "plant_in_service_ferc1": {
                    "size": 10
                },
                "plant_unit_epa": {
                    "size": 10
                },
                "plants_eia": {
                    "size": 10
                },
                "plants_eia860": {
                    "size": 10
                },
                "plants_entity_eia": {
                    "size": 10
                },
                "plants_ferc1": {
                    "size": 10
                },
                "plants_hydro_ferc1": {
                    "units": {
                        "capacity_mw": "MW",
                        "peak_demand_mw": "MW",
                        "net_capacity_favorable_conditions_mw": "MW",
                        "net_capacity_adverse_conditions_mw": "MW",
                        "net_generation_mwh": "MWh",
                        "capex_land": "USD",
                        "capex_structures": "USD",
                        "capex_facilities": "USD",
                        "capex_equipment": "USD",
                        "capex_roads": "USD",
                        "asset_retirement_cost": "USD",
                        "capex_total": "USD",
                        "capex_per_mw": "USD/MW",
                        "opex_operations": "USD",
                        "opex_water_for_power": "USD",
                        "opex_hydraulic": "USD",
                        "opex_electric": "USD",
                        "opex_generation_misc": "USD",
                        "opex_rents": "USD",
                        "opex_engineering": "USD",
                        "opex_structures": "USD",
                        "opex_dams": "USD",
                        "opex_plant": "USD",
                        "opex_misc_plant": "USD",
                        "opex_total": "USD",
                        "opex_per_mwh": "USD/MWh"
                    },               
                    "size": 10
                },
                "plants_pudl": {
                    "size": 10
                },
                "plants_pumped_storage_ferc1": {
                    "units": {
                        "capacity_mw": "MW",
                        "peak_demand_mw": "MW",
                        "plant_capability_mw": "MW",
                        "net_generation_mwh": "MWh",
                        "energy_used_for_pumping_mwh": "MWh",
                        "net_load_mwh": "MWh",
                        "capex_land": "USD",
                        "capex_structures": "USD",
                        "capex_facilities": "USD",
                        "capex_wheels_turbines_generators": "USD",
                        "capex_equipment_electric": "USD",
                        "capex_equipment_misc": "USD",
                        "capex_roads": "USD",
                        "asset_retirement_cost": "USD",
                        "capex_total": "USD",
                        "capex_per_mw": "USD/MW",
                        "opex_operations": "USD",
                        "opex_water_for_power": "USD",
                        "opex_pumped_storage": "USD",
                        "opex_electric": "USD",
                        "opex_generation_misc": "USD",
                        "opex_rents": "USD",
                        "opex_engineering": "USD",
                        "opex_structures": "USD",
                        "opex_dams": "USD",
                        "opex_plant": "USD",
                        "opex_misc_plant": "USD",
                        "opex_production_before_pumping": "USD",
                        "opex_pumping": "USD",
                        "opex_total": "USD",
                        "opex_per_mwh": "USD/MWh"
                    },               
                    "size": 10
                },
                "plants_small_ferc1": {
                    "units": {
                        "capacity_mw": "MW",
                        "peak_demand_mw": "MW",
                        "net_generation_mwh": "MWh",
                        "total_cost_of_plant": "USD",
                        "capex_per_mw": "USD/MW",
                        "opex_total": "USD",
                        "opex_fuel": "USD",
                        "opex_maintenance": "USD",
                        "fuel_cost_per_mmbtu": "USD/MMBTU"
                    },               
                    "size": 10
                },
                "plants_steam_ferc1": {
                    "units": {
                        "capacity_mw": "MW",
                        "peak_demand_mw": "MW",
                        "plant_capability_mw": "MW",
                        "water_limited_capacity_mw": "MW",
                        "not_water_limited_capacity_mw": "MW",
                        "net_generation_mwh": "MWh",
                        "capex_land": "USD",
                        "capex_structures": "USD",
                        "capex_equipment": "USD",
                        "capex_total": "USD",
                        "capex_per_mw": "USD/MW",
                        "opex_operations": "USD",
                        "opex_fuel": "USD",
                        "opex_coolants": "USD",
                        "opex_steam": "USD",
                        "opex_steam_other": "USD",
                        "opex_transfer": "USD",
                        "opex_electric": "USD",
                        "opex_misc_power": "USD",
                        "opex_rents": "USD",
                        "opex_allowances": "USD",
                        "opex_engineering": "USD",
                        "opex_structures": "USD",
                        "opex_boiler": "USD",
                        "opex_plants": "USD",
                        "opex_misc_steam": "USD",
                        "opex_production_tota;": "USD",
                        "opex_per_mwh": "USD/MWh",
                        "asset_retirement_cost": "USD"
                    },               
                    "size": 10
                },
                "prime_movers_eia923": {
                    "size": 10
                },
                "purchased_power_ferc1": {
                    "units": {
                        "billing_demand_mw": "MW",
                        "non_coincident_peak_demand_mw": "MW",
                        "coincident_peak_demand_mw": "MW",
                        "purchased_mwh": "MWh",
                        "received_mwh": "MWh",
                        "delivered_mwh": "MWh",
                        "demand_charges": "USD",
                        "energy_charges": "USD",
                        "other_charges": "USD",
                        "total_settlement": "USD"
                    },
                    "size": 10
                },
                "transport_modes_eia923": {
                    "size": 10
                },
                "utilities_eia": {
                    "size": 10
                },
                "utilities_eia860": {
                    "size": 10
                },
                "utilities_entity_eia": {
                    "size": 10
                },
                "utilities_ferc1": {
                    "size": 10
                },
                "utilities_pudl": {
                    "size": 10
                },
                "utility_plant_assn": {
                    "size": 10
                }
            }
        }
    }
}