USEPA / ElectricityLCI

Creative Commons Zero v1.0 Universal
24 stars 10 forks source link

Not Found: ftp://newftp.epa.gov #207

Open dt-woods opened 8 months ago

dt-woods commented 8 months ago

I am trying to download the CEMS data from electricitylci.

Surprisingly, FTP still exists (despite a decade of slow eradication); however, the address referenced in cems_data.py does not appear to be responsive.

This FTP does not appear to have a username or password associated with it, so I'm guessing it's a public FTP, so we aren't using SFTP (ergo, no paramiko).

I've tried a couple of different things on different OS's, but I keep getting this error. See code snippet for reproducible error (taken from _download_FTP in cems_data.py):

>>> import urllib
>>> import ftplib 
>>> from electricitylci.cems_data import source_url
>>> my_url = source_url('epacems', 2016, 1, 'PA')
>>> my_url  # !!! notice the double forward slash after 'quarterly'
'ftp://newftp.epa.gov/dmdnload/emissions/daily/quarterly//2016/DLY_2016paQ1.zip'
>>> domain = urllib.parse.urlparse(my_url).netloc
>>> domain
'newftp.epa.gov'
>>> ftp = ftplib.FTP(domain)
---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
Cell In[8], line 1
----> 1 ftp = ftplib.FTP(domain)

File /usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ftplib.py:121, in FTP.__init__(self, host, user, passwd, acct, timeout, source_address, encoding)
    119 self.timeout = timeout
    120 if host:
--> 121     self.connect(host)
    122     if user:
    123         self.login(user, passwd, acct)

File /usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ftplib.py:162, in FTP.connect(self, host, port, timeout, source_address)
    160 self.af = self.sock.family
    161 self.file = self.sock.makefile('r', encoding=self.encoding)
--> 162 self.welcome = self.getresp()
    163 return self.welcome

File /usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ftplib.py:244, in FTP.getresp(self)
    243 def getresp(self):
--> 244     resp = self.getmultiline()
    245     if self.debugging:
    246         print('*resp*', self.sanitize(resp))

File /usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ftplib.py:230, in FTP.getmultiline(self)
    229 def getmultiline(self):
--> 230     line = self.getline()
    231     if line[3:4] == '-':
    232         code = line[:3]

File /usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ftplib.py:218, in FTP.getline(self)
    216     print('*get*', self.sanitize(line))
    217 if not line:
--> 218     raise EOFError
    219 if line[-2:] == CRLF:
    220     line = line[:-2]

EOFError: 

A little poking around and I found that there is an alternative website:

However, I can't seem to place where the CEMS data are located here.

Suggestions?

https://github.com/USEPA/ElectricityLCI/blob/e56268132f7607ead58a33bb5bdd525563a784f5/electricitylci/cems_data.py#L319

dt-woods commented 8 months ago

Here's the site I found for the new HTTPS site:

dt-woods commented 8 months ago

I found what looks like it might be CEMS data here:

but I'm not getting that it's by state or by quarter.

m-jamieson commented 8 months ago

Ugh. https://github.com/USEPA/cam-api-examples

dt-woods commented 8 months ago

A challenge with the API solution is managing the API key, which can be done within the config.yml, similar to what was done in the scenario modeler.

I found the CAMPD custom data download:

Their website calls the following API (e.g., for Pennsylvania, Q1, 2016):

Looking at cems_data.py, it seems that build_cems_df groups (aggregates) the quarterly results... do we need hourly data?

Also, does/could stewi support CAMS emission inventory?

m-jamieson commented 8 months ago

The StEWI question is one for @bl-young.

Can we use that API as-is without having to get a key, etc.?

One alternative solution that I believe I discussed with Ben before is using personal API keys to download the data and then make that data available in repositories. He touched on this in issue 205 https://github.com/USEPA/ElectricityLCI/issues/205#issuecomment-1747557800. It's not as transparent as I think we would ultimately like it, but it will save potential users the trouble of having to get their own API keys to get at the data.

bl-young commented 8 months ago

Flowsa has dealt with API key issues in the past (see here, though I think this is talk of tweaking the approach in the future). I think that if this is facility data, and it falls within the schema available in stewi (e.g. FlowByFacility) than it could be a candidate for hosting this workflow. I am not very familiar with the nuances of this specific source. StEWI does not currently have an approach for handling API keys. Also seems like this could be a problem: image

But yes, I think Matt's idea in general is a good one. Have a processed version available somewhere already, but allow users the ability to generate their own for whatever reason if they get an API key. Though I would recommend not storing it in the repository itself but rather some other public spot. Or vice versa, first check if an API key exists and process locally, and if it does not, go grab the processed data from your external source. That way it is still transparent and shows the script/fxn used to generate the processed data but users don't have to run that chunk of code.

Metadata is crucial here for reproducibility because users may be using different versions of the CEMS data depending on when it was pulled and whether it changed.

m-jamieson commented 8 months ago

Oh I forgot to add, we don't need hourly data. I think I grabbed the quarterly data originally for this very reason. I believe we only got daily data that way. At this point for the eLCI, annual emissions are all that are needed - that would match every other data source, and that's how the quarterly data is aggregated anyways.

In some future version, it might be nice to be able to generate data at some fraction of the year, seasonal or even daily, but for now, I wouldn't worry about that at all. That is if annual data is somehow available through the API, I would take that - much smaller download.

dt-woods commented 8 months ago

It's pretty straightforward with the API, which tool all of about 30 seconds to request from here, see snippet:

>>> import requests
>>> s_url = "https://api.epa.gov/easey/streaming-services/emissions/apportioned/annual/by-facility"
>>> params = {'api_key': 'abcXYZ', 'year': 2016, 'stateCode': 'PA'}
>>> r = requests.get(s_url, params=params)
>>> len(r.json())  # number of facilities in PA for 2016
74
>>> r.json()
[{'stateCode': 'PA',
  'facilityName': 'Brunot Island Power Station',
  'facilityId': 3096,
  'year': 2016,
  'grossLoad': 44642.41,
  'steamLoad': None,
  'so2Mass': 0.177,
  'co2Mass': 34891.613,
  'noxMass': 9.454,
  'heatInput': 587078.638},
...
{'stateCode': 'PA',
  'facilityName': 'ETMT Marcus Hook Terminal',
  'facilityId': 880107,
  'year': 2016,
  'grossLoad': None,
  'steamLoad': 2161856.78,
  'so2Mass': None,
  'co2Mass': None,
  'noxMass': 41.047,
  'heatInput': 3261517.455}]
dt-woods commented 8 months ago

The StEWI question is one for @bl-young.

Can we use that API as-is without having to get a key, etc.?

One alternative solution that I believe I discussed with Ben before is using personal API keys to download the data and then make that data available in repositories. He touched on this in issue 205 #205 (comment). It's not as transparent as I think we would ultimately like it, but it will save potential users the trouble of having to get their own API keys to get at the data.

Matt, the API requires a key.

I'm hearing a few options, please clarify which you prefer:

  1. We create an 'ELCI' API key and include it with the electricitylci package. The annual facility-level emissions should be well under the 1 M record limit, but leaves the key vulnerable.
  2. We query the API and provide annual emissions (e.g., 2016–2022), but requires a public server to store the data or bloating the repository data directory even mores than it already is.
  3. We integrate the API as a configuration parameter in the YAML, but requires users to apply for their own key.

Or some combination?

P.S. I don't think requesting an API key was burdensome and I think it helps government agencies track (and justify) their data management budgets.

dt-woods commented 8 months ago

For completeness, here is where I referenced for my API call snippet:

bl-young commented 8 months ago

I know this question is not for me... but I agree with these choices and would of course not recommend 1. I still think 2 and 3 are not mutually exclusive given that 2 provides a useful way towards reproducibility without worrying about changes in the CEMS data, or if the API goes down.

EPA's data mangement system has been quite easy to work with https://dmap-data-commons-ord.s3.amazonaws.com/index.html?prefix= and esupy is already configured to use it, if that is the route you decide to take.

m-jamieson commented 8 months ago

I would say I do agree with Ben's approach - if the user does not have an API key or if they choose to download, then we can provide "canoncical" datasets on the AWS site mentioned above. In the short term and in the interest of getting newer data and a working version of elci out there, I would suggest that we focus on getting the canonical data sorted using manual pulls if necessary and with keeping up the metadata standard that exists on EPA data management system. The other reason I would advocate for this is that I know my machine runs into issues with the RCRAInfo (I think) data pulls because it relies on a chrome plugin that is blocked by my government machine.

The branched approach and building up all the API calls to me at least sounds like more effort than I intended for this year and is maybe something to shoot for next year, as we likely have to integrate EIA API calls as well. Up to two keys needed!

bl-young commented 8 months ago

The other reason I would advocate for this is that I know my machine runs into issues with the RCRAInfo (I think) data pulls because it relies on a chrome plugin that is blocked by my government machine.

In this regard you are in luck! @dt-woods found a solution there a few weeks ago https://github.com/USEPA/standardizedinventories/issues/146

dt-woods commented 2 months ago

The backend API call for CAMS needs updated, see here.

Preliminary search over each state's annual facility emissions appears to be maxed at around 150 records (for Texas). The max returns from a single page request over the API is 500 records. Be warned that a future year with a state with more than 500 facility records will need to handle multiple page requests!