datasets / awesome-data

Curated list of quality open datasets
https://datahub.io/collections
774 stars 98 forks source link

US public utility data (energy etc) #281

Open rufuspollock opened 6 years ago

rufuspollock commented 6 years ago

https://github.com/catalyst-cooperative/pudl

The Public Utility Data Liberation project aims to provide a useful interface to publicly available electric utility data in the US. It uses information from the Federal Energy Regulatory Commission (FERC), the Energy Information Administration (EIA), and the Environmental Protection Agency (EPA), among others.

zaneselvans commented 6 years ago

Would it make sense to have a top level Energy category (akin to Climate Change) which would contain this and other energy related datasets?

Another project that's doing the same kind of thing but for the EU is OPSD, and they're already using data packages, so it ought to be easy to integrate them into datahub. Will definitely bring it up at the meeting in Berlin.

https://open-power-system-data.org/

zaneselvans commented 6 years ago

US data that ought to be brought in includes:

The MSHA datasets are already provided as bulk CSV files with keys that allow them to be connected together in a relational database. Maybe I can try and package that up for datahub.io just to get familiar with the toolchain since it's already clean...

rufuspollock commented 5 years ago

@zaneselvans this is awesome :smile: Could you package this up and share it?

zaneselvans commented 5 years ago

We're working on packaging the EIA 860, 923, and FERC Form 1 now to publish openly. Less sure about what the right way to share the CEMS is, given how large the files are when uncompressed.

Right now the plan is to do two types of data packages. One with the normalized database tables, including their foreign key relationships, for someone who wants to quickly re-create a relational DB version of the data locally, and another that's more spreadsheet like, with compiled de-normalized versions of the data, organized into useful tabular resources, including many derived values. Does that seem reasonable? Datahub seems like a good place to host it, but I'm concerned about the size of the packages, and the creation of duplicate CSV and JSON versions upon upload. We can partition the data by state or year to keep updates from requiring too much bandwidth/time, but I'm not sure storing hundreds of GB of uncompressed data in several different formats seems like a great option.

rufuspollock commented 5 years ago

@zaneselvans do you have any README for this stuff - we could already boot an awesome page with the overview and then start linking to the data packages as they go up. What do you think?

zaneselvans commented 5 years ago

There's the readme in our repo, which has a little blurb about each of the data sources, and the status of their integration, but I'm not sure if that's enough. We did get a (very) small grant to do the packaging and publishing work so we will definitely be doing it.

rufuspollock commented 5 years ago

@zaneselvans which repo exactly :smile: - and great news on the grant!

zaneselvans commented 5 years ago

Ah sorry. This is the main PUDL repo.