Open zaneselvans opened 2 years ago
I'll do a review on the Intake catalog PR but this is just a few comments from a first pass at the notebook:
gs://
URLs?
GOOGLE_CLOUD_PROJECT
project_id to be catalyst-cooperative-pudl
as well, but seems like you need authorization. The https://
links seem unworkably slow as well.pudl-catalog.yml
like it is for metadata.yml
for Datasette. pudl_cat.epacems_one_file.discover()
) don't seem to be the same for me as they are for you (unless I'm misunderstanding your comment). epacems_one_file
and epacems_multi_file
. pd.read_parquet()
on the whole EPA CEMS directory, then unit_id_epa
is an Int32
instead of a string
, on a single file it is a string
, and on a remote file it's an object
. When read in with the intake catalog it is back to an Int32
dtype': {'plant_id_eia': 'int32',
'unitid': 'string',
'operating_datetime_utc': 'datetime64[ns, UTC]',
'operating_time_hours': 'float32',
'gross_load_mw': 'float32',
'steam_load_1000_lbs': 'float32',
'so2_mass_lbs': 'float32',
'so2_mass_measurement_code': 'category',
'nox_rate_lbs_mmbtu': 'float32',
'nox_rate_measurement_code': 'category',
'nox_mass_lbs': 'float32',
'nox_mass_measurement_code': 'category',
'co2_mass_tons': 'float32',
'co2_mass_measurement_code': 'category',
'heat_content_mmbtu': 'float32',
'facility_id': 'Int32',
'unit_id_epa': 'Int32'},
intake_parquet
after getting a somewhat cryptic error. intake
and intake_parquet
should be added to pudl-dev
?Whoops yes I forgot to add the intake requirements. I had them installed in my local environment.
The dtypes you've got listed there seem to be the correct ones. But you have no year
or state
columns. I'm still confused as to why those aren't showing up, given that the data is definitely stored in the files.
I really don't understand how the source specific metadata works. My suspicion is that the allowable year/state values can be put in there, and the column/table descriptions, but I don't see any documentation on how to do it appropriately.
Hey @martindurant thanks so much for your comment on #1496! I got simplecache
working and created a basic installable catalog, and have been experimenting with different setups for our open US energy data catalog over in the pudl-catalog repo. I've collected a bunch of outstanding issues in the issue above, which point at the individual pudl-catalog
issues and was wondering if we might be able to get some advice from you on how best to set things up. I'm not sure which of these things is just me not understanding how to configure the catalog correctly and which are deeper constraints.
Do you happen to have a list of publicly visible intake catalogs that use Parquet data sources? I've tried searching GitHub but haven't been very successful. The CarbonPlan Data repo is the best I've seen, but they have a very simple configuration.
Once the EPA CEMS Hourly Emissions data source is finished, we also want to look at writing an intake-sqlite driver (#1156) to manage the distribution of versioned SQLite databases, which will download and cache the database file locally, and then use the intake-sql
driver to access it. Does that seem like a reasonable approach? Thanks for any pointers you can offer!
Description
Create a full featured Intake Catalog for distributing the EPA CEMS hourly emissions data stored as Parquet files. This follows some exploration in #1155. See also notes in #1495 and PR #1563
Billing
This work should be billed under our Sloan Foundation "Data Distribution" sub-project.
Goals
Tasks / Issues tracked by this Epic
Phase 1:
Get a functional intake catalog deployed for demonstration & feedback.
read_parquet()
fsspec
/simplecache
(see comment on #1496)pudl_catalog
for installation usingpip
.Phase 2:
Flesh out metadata and improve performance.
Out of Scope