USEPA / standardizedinventories

Standardized Release and Waste Inventories
MIT License
25 stars 16 forks source link

Inconsistent Data Management #145

Open dt-woods opened 9 months ago

dt-woods commented 9 months ago

I'm not sure what the intended pathway is for data management, but seeing a general buy-in on esupy's data manager, I'm guessing that's the direction you are heading. That said, it doesn't seem like there is any commonality between approaches when looking across the main data modules (i.e., egrid, NEI, RCRAInfo, and TRI).

  1. With the latest fix on esupy (v.0.3.1), eGRID data now downloads from source (from EPA website) and the metadata files are generated locally.
  2. NEI data are not generated from source; rather, pulled from the AWS remote server (here), and seems to work, albeit differently from eGRID
  3. TRI seems to do its own thing. It attempts to download data from source; however, it references the url key, which does not point to a data file. It is stored in the unique zip_url keyword (not shared by other databases). It doesn't use esupy methods. It fails.
  4. RCRAInfo requires the unique extra packages selenium and web driver_manager. Latest versions of these packages (4.12 and 4.0, respectively^1) crash, see below (notice also the typo on the error message for RCRAInfo at timestamp: 2023-09-18 13:49:45.430).
    • Note also that I am not a Google Chrome user. Please let me know if this is also a prerequisite.
    • Further note that the error log message at the top of RCRAInfo.py does not trigger an actual message; where does it go?
>>> getInventory('RCRAInfo', 2015)
2023-09-18 13:49:45.360:INFO:globals:read_inventory:RCRAInfo_2015 not found in ~/Library/Application Support/stewi/flowbyfacility
2023-09-18 13:49:45.360:INFO:globals:read_inventory:requested inventory does not exist in local directory, it will be generated...
2023-09-18 13:49:45.430:INFO:RCRAInfo:download_and_extract_zip:Initiating download via browswer...
2023-09-18 13:49:45.430:INFO:logger:log:====== WebDriver manager ======
/bin/sh: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome: No such file or directory
/bin/sh: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome: No such file or directory
2023-09-18 13:49:45.507:INFO:logger:log:Get LATEST chromedriver version for google-chrome
2023-09-18 13:49:45.641:INFO:logger:log:About to download new driver from https://chromedriver.storage.googleapis.com/114.0.5735.90/chromedriver_mac64.zip
2023-09-18 13:49:45.724:INFO:logger:log:Driver downloading response is 200
/bin/sh: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome: No such file or directory
2023-09-18 13:49:45.958:INFO:logger:log:Get LATEST chromedriver version for google-chrome
/bin/sh: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome: No such file or directory
2023-09-18 13:49:47.174:INFO:logger:log:Get LATEST chromedriver version for google-chrome
/bin/sh: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome: No such file or directory
2023-09-18 13:49:47.299:INFO:logger:log:Driver has been saved in cache [~/.wdm/drivers/chromedriver/mac64/114.0.5735.90]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 getInventory('RCRAInfo', 2015)

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/__init__.py:82, in getInventory(inventory_acronym, year, stewiformat, filters, filter_for_LCI, US_States_Only, download_if_missing, keep_sec_cntx)
     66 """Return or generate an inventory in a standard output format.
     67 
     68 :param inventory_acronym: like 'TRI'
   (...)
     79 :return: dataframe with standard fields depending on output format
     80 """
     81 f = ensure_format(stewiformat)
---> 82 inventory = read_inventory(inventory_acronym, year, f,
     83                            download_if_missing)
     85 if (not keep_sec_cntx) and ('Compartment' in inventory):
     86     inventory['Compartment'] = (inventory['Compartment']
     87                                 .str.partition('/')[0])

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/globals.py:331, in read_inventory(inventory_acronym, year, f, download_if_missing)
    328 else:
    329     log.info('requested inventory does not exist in local directory, '
    330              'it will be generated...')
--> 331     generate_inventory(inventory_acronym, year)
    332 inventory = load_preprocessed_output(meta, paths)
    333 if inventory is None:

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/globals.py:372, in generate_inventory(inventory_acronym, year)
    370 elif inventory_acronym == 'RCRAInfo':
    371     import stewi.RCRAInfo as RCRAInfo
--> 372     RCRAInfo.main(Option = 'A', Year = [year],
    373                   Tables = ['BR_REPORTING', 'HD_LU_WASTE_CODE'])
    374     RCRAInfo.main(Option = 'B', Year = [year],
    375                   Tables = ['BR_REPORTING'])
    376     RCRAInfo.main(Option = 'C', Year = [year])

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/RCRAInfo.py:477, in main(**kwargs)
    473     """If issues in running this option to download the data, go to the
    474     specified url and find the BR_REPORTING_year.zip file and save to
    475     OUTPUT_PATH. Also requires HD_LU_WASTE_CODE.zip"""
    476     query = _config['queries']['Table_of_tables']
--> 477     download_and_extract_zip(tables, query)
    479 elif kwargs['Option'] == 'B':
    480     organize_br_reporting_files_by_year(kwargs['Tables'], year)

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/RCRAInfo.py:165, in download_and_extract_zip(tables, query)
    159 prefs = {'download.default_directory': str(OUTPUT_PATH),
    160         'download.prompt_for_download': False,
    161         'download.directory_upgrade': True,
    162         'safebrowsing_for_trusted_sources_enabled': False,
    163         'safebrowsing.enabled': False}
    164 options.add_experimental_option('prefs', prefs)
--> 165 browser = webdriver.Chrome(ChromeDriverManager().install(),
    166                            options=options)
    167 browser.maximize_window()
    168 browser.set_page_load_timeout(30)

TypeError: WebDriver.__init__() got multiple values for argument 'options'
bl-young commented 9 months ago

Yes that lack of consistency in how the data are accessed is a bit of a relic and needs to be updated. This is especially true given #144 which is affecting multiple sources it seems. The goal will be to shift towards using data calls via esupy for consistency. I was just starting this on a new branch (requests_update) but have not yet finished.

bl-young commented 9 months ago

The RCRA selenium issue is one I am aware of but not yet documented. We regularly have issues accessing RCRA based on how that data is stored and provided. I have added a separate issue #146

WesIngwersen commented 9 months ago

The lack of consistency came from the original of the tool as a set of inventory specific and independent scripts written by authors who each approached data acquisition uniquely. Yes I agree we can reevaluate that as resources are available.

WesIngwersen commented 9 months ago

The lack of consistency came from the original of the tool as a set of inventory specific and independent scripts written by authors who each approached data acquisition uniquely. Yes I agree we can reevaluate that as resources are available.