Open dt-woods opened 8 months ago
Here's the site I found for the new HTTPS site:
I found what looks like it might be CEMS data here:
but I'm not getting that it's by state or by quarter.
A challenge with the API solution is managing the API key, which can be done within the config.yml, similar to what was done in the scenario modeler.
I found the CAMPD custom data download:
Their website calls the following API (e.g., for Pennsylvania, Q1, 2016):
Looking at cems_data.py, it seems that build_cems_df
groups (aggregates) the quarterly results... do we need hourly data?
Also, does/could stewi support CAMS emission inventory?
The StEWI question is one for @bl-young.
Can we use that API as-is without having to get a key, etc.?
One alternative solution that I believe I discussed with Ben before is using personal API keys to download the data and then make that data available in repositories. He touched on this in issue 205 https://github.com/USEPA/ElectricityLCI/issues/205#issuecomment-1747557800. It's not as transparent as I think we would ultimately like it, but it will save potential users the trouble of having to get their own API keys to get at the data.
Flowsa has dealt with API key issues in the past (see here, though I think this is talk of tweaking the approach in the future). I think that if this is facility data, and it falls within the schema available in stewi (e.g. FlowByFacility) than it could be a candidate for hosting this workflow. I am not very familiar with the nuances of this specific source. StEWI does not currently have an approach for handling API keys. Also seems like this could be a problem:
But yes, I think Matt's idea in general is a good one. Have a processed version available somewhere already, but allow users the ability to generate their own for whatever reason if they get an API key. Though I would recommend not storing it in the repository itself but rather some other public spot. Or vice versa, first check if an API key exists and process locally, and if it does not, go grab the processed data from your external source. That way it is still transparent and shows the script/fxn used to generate the processed data but users don't have to run that chunk of code.
Metadata is crucial here for reproducibility because users may be using different versions of the CEMS data depending on when it was pulled and whether it changed.
Oh I forgot to add, we don't need hourly data. I think I grabbed the quarterly data originally for this very reason. I believe we only got daily data that way. At this point for the eLCI, annual emissions are all that are needed - that would match every other data source, and that's how the quarterly data is aggregated anyways.
In some future version, it might be nice to be able to generate data at some fraction of the year, seasonal or even daily, but for now, I wouldn't worry about that at all. That is if annual data is somehow available through the API, I would take that - much smaller download.
It's pretty straightforward with the API, which tool all of about 30 seconds to request from here, see snippet:
>>> import requests
>>> s_url = "https://api.epa.gov/easey/streaming-services/emissions/apportioned/annual/by-facility"
>>> params = {'api_key': 'abcXYZ', 'year': 2016, 'stateCode': 'PA'}
>>> r = requests.get(s_url, params=params)
>>> len(r.json()) # number of facilities in PA for 2016
74
>>> r.json()
[{'stateCode': 'PA',
'facilityName': 'Brunot Island Power Station',
'facilityId': 3096,
'year': 2016,
'grossLoad': 44642.41,
'steamLoad': None,
'so2Mass': 0.177,
'co2Mass': 34891.613,
'noxMass': 9.454,
'heatInput': 587078.638},
...
{'stateCode': 'PA',
'facilityName': 'ETMT Marcus Hook Terminal',
'facilityId': 880107,
'year': 2016,
'grossLoad': None,
'steamLoad': 2161856.78,
'so2Mass': None,
'co2Mass': None,
'noxMass': 41.047,
'heatInput': 3261517.455}]
The StEWI question is one for @bl-young.
Can we use that API as-is without having to get a key, etc.?
One alternative solution that I believe I discussed with Ben before is using personal API keys to download the data and then make that data available in repositories. He touched on this in issue 205 #205 (comment). It's not as transparent as I think we would ultimately like it, but it will save potential users the trouble of having to get their own API keys to get at the data.
Matt, the API requires a key.
I'm hearing a few options, please clarify which you prefer:
Or some combination?
P.S. I don't think requesting an API key was burdensome and I think it helps government agencies track (and justify) their data management budgets.
For completeness, here is where I referenced for my API call snippet:
I know this question is not for me... but I agree with these choices and would of course not recommend 1. I still think 2 and 3 are not mutually exclusive given that 2 provides a useful way towards reproducibility without worrying about changes in the CEMS data, or if the API goes down.
EPA's data mangement system has been quite easy to work with https://dmap-data-commons-ord.s3.amazonaws.com/index.html?prefix= and esupy is already configured to use it, if that is the route you decide to take.
I would say I do agree with Ben's approach - if the user does not have an API key or if they choose to download, then we can provide "canoncical" datasets on the AWS site mentioned above. In the short term and in the interest of getting newer data and a working version of elci out there, I would suggest that we focus on getting the canonical data sorted using manual pulls if necessary and with keeping up the metadata standard that exists on EPA data management system. The other reason I would advocate for this is that I know my machine runs into issues with the RCRAInfo (I think) data pulls because it relies on a chrome plugin that is blocked by my government machine.
The branched approach and building up all the API calls to me at least sounds like more effort than I intended for this year and is maybe something to shoot for next year, as we likely have to integrate EIA API calls as well. Up to two keys needed!
The other reason I would advocate for this is that I know my machine runs into issues with the RCRAInfo (I think) data pulls because it relies on a chrome plugin that is blocked by my government machine.
In this regard you are in luck! @dt-woods found a solution there a few weeks ago https://github.com/USEPA/standardizedinventories/issues/146
The backend API call for CAMS needs updated, see here.
Preliminary search over each state's annual facility emissions appears to be maxed at around 150 records (for Texas). The max returns from a single page request over the API is 500 records. Be warned that a future year with a state with more than 500 facility records will need to handle multiple page requests!
I am trying to download the CEMS data from electricitylci.
Surprisingly, FTP still exists (despite a decade of slow eradication); however, the address referenced in cems_data.py does not appear to be responsive.
This FTP does not appear to have a username or password associated with it, so I'm guessing it's a public FTP, so we aren't using SFTP (ergo, no paramiko).
I've tried a couple of different things on different OS's, but I keep getting this error. See code snippet for reproducible error (taken from
_download_FTP
in cems_data.py):A little poking around and I found that there is an alternative website:
However, I can't seem to place where the CEMS data are located here.
Suggestions?
https://github.com/USEPA/ElectricityLCI/blob/e56268132f7607ead58a33bb5bdd525563a784f5/electricitylci/cems_data.py#L319