catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
465 stars 107 forks source link

Review diverging dataset source paths #850

Closed ezwelty closed 3 years ago

ezwelty commented 3 years ago

When attempting to merge source attributes spread out across src/pudl/package_data/meta/datapackage.json, pudl.constants.data_source_info and pudl.constants.base_data_urls, I found some differences in the path attribute. I can't find any reason why they would differ.

Unless there is a good reason to have multiple paths for a source, we need to choose one path from the options for each source below:

source path (datapackage.json) path (constants.data_source_info)
mhsa https://arlweb.msha.gov/OpenGovernmentData/OGIMSHA.asp https://www.msha.gov/mine-data-retrieval-system
ferceqr ftp://eqrdownload.ferc.gov/DownloadRepositoryProd/BulkNew/CSV https://www.ferc.gov/industries-data/electric/power-sales-and-markets/electric-quarterly-reports-eqr
ferc1 ftp://eforms1.ferc.gov/f1allyears https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-1-electric-utility-annual
epacems ftp://newftp.epa.gov/dmdnload/emissions/hourly/monthly https://ampd.epa.gov/ampd

Note that this is after fixing some of the path in src/pudl/package_data/meta/datapackage.json as follows:

zaneselvans commented 3 years ago

In all of these cases we're looking for a path to a human readable "source" of the data right -- not the raw downloads. Both FERC and MSHA have overhauled their sites in the last year, and the old URLs no longer work, even though the raw data download location is the same in many cases. I would say...

ezwelty commented 3 years ago

@zaneselvans Thanks for the quick response. I agree that human readable is the right choice. I'll implement your choices.

cmgosnell commented 3 years ago

i believe this is implemented in #806