climate-mirror / datasets

For tracking data mirroring progress
201 stars 18 forks source link

EPA Superfund Documents Collection #329

Open JeremiahCurtis opened 7 years ago

JeremiahCurtis commented 7 years ago

Trying to grab superfund documents, but this could be a herculean task, as I cannot find any ftp directory or https database page................rather, I have to crawl through numerous links, from https://semspub.epa.gov/src/search and entering data ONLY in the "region" and "collection type" fields

Unless I'm missing something here and there really is a simpler way to get these, this is an issue that should probably be broken up, say by region (each region, in turn, contains "administrative records" and "special collections", which must be brought up separately)

A further problem as follows: When I get to a page listing all the documents attached to a particular superfund location, say https://semspub.epa.gov/src/collection/01/AR1 , there is a list of hyperlinks to documents (clicking on "show ALL entries" to save time).....when I try direct download for a given link in the browser, the link shows up as a pdf document, but in downthemall extension, I get nothing

btw, if someone has already tackled this, feel free to close this issue........I can't imagine this collection being safe in the near future

TechMaz commented 7 years ago

@JeremiahCurtis Looks like some of these directories allow for direct file access here: https://semspub.epa.gov/work/ I found that by just visiting some of the pdfs that you are talking about and went up a couple directories. I'm not sure if any of those have been grabbed yet.

TechMaz commented 7 years ago

It could also be useful to save pages from here: https://yosemite.epa.gov/r10/cleanup.nsf

TechMaz commented 7 years ago

Interesting groupings of documents can be found on pages like this: https://yosemite.epa.gov/r10/cleanup.nsf/73defd6beb7e2b5188257e22005fc6e8/aeb7a0292d44e9ee882571b00062510a!OpenDocument

TechMaz commented 7 years ago

Also more listings of Superfund info here: https://cumulis.epa.gov/supercpad/CurSites/srchrslt.cfm?start=1

TechMaz commented 7 years ago

And here: https://yosemite.epa.gov/r10/cleanup.nsf/sites/

JeremiahCurtis commented 7 years ago

@TechMaz thanks a ton for the https://semspub.epa.gov/work/ link.......that does appear to have all the Superfund docs

Is anyone else able to grab particular regions? I'm working on region 1 alone via downthemall (fiinally got it to work last night), and I'm nowhere near finishing after about 12 hours. In fact, I've barely begun. I'd estimate several TB worth of docs across all 10 regions and HQ documents

TechMaz commented 7 years ago

@JeremiahCurtis I don't have the storage space to help with this, but I'm sure someone else does.

donbright commented 7 years ago

size estimates

lftp quicksilver.epa.gov:/work> du -h -d1
353G    ./01                                                     
127G    ./02                                                     
48G     ./03                                                     
20G     ./04                                                     
160G    ./05                                                     
81G     ./06                                                     
52G     ./07                                                     
24G     ./08                                                     
33G     ./09                                                     
52G     ./10                                                     
1.5G    ./11                                                     
35G     ./HQ                                          
0       ./lost+found                                          
982G    .
TechMaz commented 7 years ago

@donbright Thanks for that! I have enough space to do ./11 if you want @JeremiahCurtis

donbright commented 7 years ago

lol i accidentally destroyed my digital ocean mirror server trying to do this. its now frozen trying to mount a corrupted volume.

TechMaz commented 7 years ago

Ahh

hkuchampudi commented 7 years ago

I've mined the metadata associated will the Superfund documents in my repository. In the same repository I also have a csv with direct links for all the documents; it may make divvying the task a little bit easier. While I want to help with the download process, I don't have the storage for it :(

donbright commented 7 years ago

currently pulling 06,07,08,09,10,11,HQ, will update....


update - this server is incredibly fast (2-5M/s) if you are copying to another cloud server, i should have the entire thing in an offline-copy by tomorrow. Just in time, before Lil' Scotty Pruit can throw the whole thing in the dumpster.


offline copy complete except file 01/589487.pdf which appears to have some kind of problem making it impossible to download. it causes lftp mirror to freeze and an individual 'get' doesnt work either

url saved was http://semspub.epa.gov/work/ directories 01 thru 11 and HQ

TechMaz commented 7 years ago

Great! Can anyone mirror to a public location?

JeremiahCurtis commented 7 years ago

Does anyone have any ftp or https data directories for the following? My internet service was down the entire weekend, but hopefully I'm not too late to catch up.

BTW, if any of these have been resolved in other issues, please disregard....As far as I know, they have not been raised elsewhere. Some may be only tangentially related to climate change research, but I would imagine that they're all subject to tampering by the current administration

Also, the urls I provided just give an overview of the respective datasets; the search queries on these urls are useless for any large-scale or comprehensive downloads 1) https://www.epa.gov/enviro/pcs-icis-search

This search allows you to retrieve selected data from the Permit Compliance System (PCS) and Integrated Compliance Information System (ICIS) databases in Envirofacts regarding facilities registered with the federal enforcement and compliance (FE&C) and holding National Pollutant Discharge Elimination System (NPDES) permits.

2) https://www.epa.gov/enviro/tri-search

The Toxics Release Inventory (TRI) Search retrieves data from the TRI database in Envirofacts.

TRI Search allows access to basic facility information, all forms submitted to EPA since 1987, aggregate chemical release data for all years reported, and relative risk information. The results display any facility that has reported from 1987 to present, even though the facility may or may not have submitted TRI data in the most recent reporting year. The last year of data displayed represents the last year TRI data was reported.

For each facility there is a link to summarized TRI information for years reported, Federal Registry System (FRS) facility information, a corresponding Risk Screening Environmental Indicator (RSEI) report that provides a quantitative, relative estimate of risk posed by the facility based on the chemical released and potential exposure pathways, and a Pollution Prevention (P2) report presenting measures taken to prevent pollution and reduce the amount of toxic chemicals entering the environment. You may narrow your search by filtering through facility name/ID, geographic location, standard industrial classification, and chemical names/CAS numbers.

3) https://www3.epa.gov/enviro/facts/tsca/tsca_search.html

The TSCA Search allows you to retrieve selected data from the Toxic Substances Control Act database in Envirofacts.

4) https://www.epa.gov/enviro/sdwis-search

Information about ​safe drinking water is stored in SDWIS, the EPA's Safe Drinking Water Information System. SDWIS tracks information on drinking water contamination levels as required by the 1974 Safe Drinking Water Act and its 1986 and 1996 amendments. The Safe Drinking Water Act (SDWA) and accompanying regulations establish Maximum Contaminant Levels (MCLs), treatment techniques, and monitoring and reporting requirements to ensure that water provided to customers is safe for human consumption. The Safe Drinking Water Information System (SDWIS) contains information about public water systems and their violations of EPA's drinking water regulations. Searching SDWIS will allow you to locate your drinking water supplier and view its violations and enforcement history for the last ten years.

5) https://www3.epa.gov/enviro/facts/rcrainfo/search.html

Hazardous waste information is contained in the Resource Conservation and Recovery Act Information (RCRAInfo), a national program management and inventory system about hazardous waste handlers. In general, all generators, transporters, treaters, storers, and disposers of hazardous waste are required to provide information about their activities to state environmental agencies. These agencies, in turn pass on the information to regional and national EPA offices. This regulation is governed by the Resource Conservation and Recovery Act (RCRA), as amended by the Hazardous and Solid Waste Amendments of 1984. You may use the RCRAInfo Search to determine identification and location data for specific hazardous waste handlers, and to find a wide range of information on treatment, storage, and disposal facilities regarding permit/closure status, compliance with Federal and State regulations, and cleanup activities.

6) https://www3.epa.gov/enviro/facts/radinfo/search.html

The Radiation Information Database (RADINFO) Search allows you to retrieve selected data from RADINFO database in Envirofacts regarding facilities EPA regulates for radiation or radioactivity.

7) https://archive.epa.gov/enviro/html/icr/web/html/icr_query.html

The Information Collection Rule (ICR) query form allows you to retrieve disinfection byproduct and microbial reports at the national, state, and utility levels.

REMINDER:

The ICR data were collected as part of a national research project to support development of national drinking water standards. They should NOT be used to determine local water systems compliance with drinking water standards, nor should they be used to make personal judgements about individual health risks.

The ICR data were collected from July 1997 to December 1998. All the data have been verified, and the database is complete. EPA will use the data to identify national and regional patterns, not to reach system-by-system or treatment plant-by-treatment plant conclusions. The data that you will see are from individual samples. What EPA will evaluate is the degree to which the samples represent overall water quality.

Several states (and two U.S. territories) did not participate in the ICR: Montana, North Dakota, Vermont, Wyoming, the Virgin Islands, and Guam.

8) https://www.epa.gov/enviro/icis-air-overview

ICIS-AIR contains compliance and permit data for stationary sources of air pollution (such as electric power plants, steel mills, factories, and universities) regulated by EPA, state and local air pollution agencies. The information in ICIS-AIR is used by the states to prepare State Implementation Plans (SIPs) and to track the compliance status of point sources with various regulatory programs under Clean Air Act.

9) https://www.epa.gov/enviro/frs-query-page 10) https://www.epa.gov/enviro/br-search

The Hazardous Waste Report (Biennial Report) collects data on the generation, management, and minimization of hazardous waste. This provides detailed data on the generation of hazardous waste from large quantity generators and data on waste management practices from treatment, storage, and disposal facilities. The Biennial Report data provide a basis for trend analyses. Data about hazardous waste activities is reported for odd number years (beginning with 1989) to EPA. EPA then provides reports on hazardous waste generation and management activity that accompany the data files.

11) Brownfield maps and grants

https://ofmpub.epa.gov/apex/cimc/f?p=cimc:MAP::::71:P71_WELSEARCH:NULL|Cleanup||||true|false|false|false|false|false|||sites|Y

TechMaz commented 7 years ago

Looks like for tri data you can get XML data for a state via: https://iaspub.epa.gov/enviro/efservice/tri_facility/state_abbr/VA/

also frs data can be downloaded as a single csv here: https://www3.epa.gov/enviro/html/fii/downloads/state_files/national_single.zip

TechMaz commented 7 years ago

I also found a data export for ICIS data here: https://iaspub.epa.gov/enviro/efservice/mv_new_geo_best_picks/ I mirrored it here for easy access: https://archive.org/download/enviro_efservice/enviro_efservice.xml

TechMaz commented 7 years ago

I was also able to download all the Hazardous Waste Report (Biennial Report) data (I think) I found it here: https://iaspub.epa.gov/enviro/efservice/BRS_WASTE_INFORMATION/CSV and mirrored data here: https://archive.org/download/BRSWASTEINFORMATION UPDATE: that was only the first 10000 entries, currently trying to get them all UPDATE 2: Ok downloaded the first 5,000,001 entries. Not sure how many there are.