catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

EPA CEMS Intake Catalog #1564

Open zaneselvans opened 2 years ago

zaneselvans commented 2 years ago

Description

Create a full featured Intake Catalog for distributing the EPA CEMS hourly emissions data stored as Parquet files. This follows some exploration in #1155. See also notes in #1495 and PR #1563

Billing

This work should be billed under our Sloan Foundation "Data Distribution" sub-project.

Goals

Tasks / Issues tracked by this Epic

Phase 1:

Get a functional intake catalog deployed for demonstration & feedback.

Phase 2:

Flesh out metadata and improve performance.

Out of Scope

katie-lamb commented 2 years ago

I'll do a review on the Intake catalog PR but this is just a few comments from a first pass at the notebook:

Questions

Other nits

zaneselvans commented 2 years ago

Whoops yes I forgot to add the intake requirements. I had them installed in my local environment.

The dtypes you've got listed there seem to be the correct ones. But you have no year or state columns. I'm still confused as to why those aren't showing up, given that the data is definitely stored in the files.

I really don't understand how the source specific metadata works. My suspicion is that the allowable year/state values can be put in there, and the column/table descriptions, but I don't see any documentation on how to do it appropriately.

zaneselvans commented 2 years ago

Hey @martindurant thanks so much for your comment on #1496! I got simplecache working and created a basic installable catalog, and have been experimenting with different setups for our open US energy data catalog over in the pudl-catalog repo. I've collected a bunch of outstanding issues in the issue above, which point at the individual pudl-catalog issues and was wondering if we might be able to get some advice from you on how best to set things up. I'm not sure which of these things is just me not understanding how to configure the catalog correctly and which are deeper constraints.

Do you happen to have a list of publicly visible intake catalogs that use Parquet data sources? I've tried searching GitHub but haven't been very successful. The CarbonPlan Data repo is the best I've seen, but they have a very simple configuration.

Once the EPA CEMS Hourly Emissions data source is finished, we also want to look at writing an intake-sqlite driver (#1156) to manage the distribution of versioned SQLite databases, which will download and cache the database file locally, and then use the intake-sql driver to access it. Does that seem like a reasonable approach? Thanks for any pointers you can offer!