Open JeremiahCurtis opened 7 years ago
@JeremiahCurtis Looks like some of these directories allow for direct file access here: https://semspub.epa.gov/work/ I found that by just visiting some of the pdfs that you are talking about and went up a couple directories. I'm not sure if any of those have been grabbed yet.
It could also be useful to save pages from here: https://yosemite.epa.gov/r10/cleanup.nsf
Interesting groupings of documents can be found on pages like this: https://yosemite.epa.gov/r10/cleanup.nsf/73defd6beb7e2b5188257e22005fc6e8/aeb7a0292d44e9ee882571b00062510a!OpenDocument
Also more listings of Superfund info here: https://cumulis.epa.gov/supercpad/CurSites/srchrslt.cfm?start=1
@TechMaz thanks a ton for the https://semspub.epa.gov/work/ link.......that does appear to have all the Superfund docs
Is anyone else able to grab particular regions? I'm working on region 1 alone via downthemall (fiinally got it to work last night), and I'm nowhere near finishing after about 12 hours. In fact, I've barely begun. I'd estimate several TB worth of docs across all 10 regions and HQ documents
@JeremiahCurtis I don't have the storage space to help with this, but I'm sure someone else does.
size estimates
lftp quicksilver.epa.gov:/work> du -h -d1
353G ./01
127G ./02
48G ./03
20G ./04
160G ./05
81G ./06
52G ./07
24G ./08
33G ./09
52G ./10
1.5G ./11
35G ./HQ
0 ./lost+found
982G .
@donbright Thanks for that! I have enough space to do ./11 if you want @JeremiahCurtis
lol i accidentally destroyed my digital ocean mirror server trying to do this. its now frozen trying to mount a corrupted volume.
Ahh
I've mined the metadata associated will the Superfund documents in my repository. In the same repository I also have a csv with direct links for all the documents; it may make divvying the task a little bit easier. While I want to help with the download process, I don't have the storage for it :(
currently pulling 06,07,08,09,10,11,HQ, will update....
update - this server is incredibly fast (2-5M/s) if you are copying to another cloud server, i should have the entire thing in an offline-copy by tomorrow. Just in time, before Lil' Scotty Pruit can throw the whole thing in the dumpster.
offline copy complete except file 01/589487.pdf which appears to have some kind of problem making it impossible to download. it causes lftp mirror to freeze and an individual 'get' doesnt work either
url saved was http://semspub.epa.gov/work/ directories 01 thru 11 and HQ
Great! Can anyone mirror to a public location?
Does anyone have any ftp or https data directories for the following? My internet service was down the entire weekend, but hopefully I'm not too late to catch up.
BTW, if any of these have been resolved in other issues, please disregard....As far as I know, they have not been raised elsewhere. Some may be only tangentially related to climate change research, but I would imagine that they're all subject to tampering by the current administration
Also, the urls I provided just give an overview of the respective datasets; the search queries on these urls are useless for any large-scale or comprehensive downloads 1) https://www.epa.gov/enviro/pcs-icis-search
This search allows you to retrieve selected data from the Permit Compliance System (PCS) and Integrated Compliance Information System (ICIS) databases in Envirofacts regarding facilities registered with the federal enforcement and compliance (FE&C) and holding National Pollutant Discharge Elimination System (NPDES) permits.
2) https://www.epa.gov/enviro/tri-search
The Toxics Release Inventory (TRI) Search retrieves data from the TRI database in Envirofacts.
TRI Search allows access to basic facility information, all forms submitted to EPA since 1987, aggregate chemical release data for all years reported, and relative risk information. The results display any facility that has reported from 1987 to present, even though the facility may or may not have submitted TRI data in the most recent reporting year. The last year of data displayed represents the last year TRI data was reported.
For each facility there is a link to summarized TRI information for years reported, Federal Registry System (FRS) facility information, a corresponding Risk Screening Environmental Indicator (RSEI) report that provides a quantitative, relative estimate of risk posed by the facility based on the chemical released and potential exposure pathways, and a Pollution Prevention (P2) report presenting measures taken to prevent pollution and reduce the amount of toxic chemicals entering the environment. You may narrow your search by filtering through facility name/ID, geographic location, standard industrial classification, and chemical names/CAS numbers.
3) https://www3.epa.gov/enviro/facts/tsca/tsca_search.html
The TSCA Search allows you to retrieve selected data from the Toxic Substances Control Act database in Envirofacts.
4) https://www.epa.gov/enviro/sdwis-search
Information about safe drinking water is stored in SDWIS, the EPA's Safe Drinking Water Information System. SDWIS tracks information on drinking water contamination levels as required by the 1974 Safe Drinking Water Act and its 1986 and 1996 amendments. The Safe Drinking Water Act (SDWA) and accompanying regulations establish Maximum Contaminant Levels (MCLs), treatment techniques, and monitoring and reporting requirements to ensure that water provided to customers is safe for human consumption. The Safe Drinking Water Information System (SDWIS) contains information about public water systems and their violations of EPA's drinking water regulations. Searching SDWIS will allow you to locate your drinking water supplier and view its violations and enforcement history for the last ten years.
5) https://www3.epa.gov/enviro/facts/rcrainfo/search.html
Hazardous waste information is contained in the Resource Conservation and Recovery Act Information (RCRAInfo), a national program management and inventory system about hazardous waste handlers. In general, all generators, transporters, treaters, storers, and disposers of hazardous waste are required to provide information about their activities to state environmental agencies. These agencies, in turn pass on the information to regional and national EPA offices. This regulation is governed by the Resource Conservation and Recovery Act (RCRA), as amended by the Hazardous and Solid Waste Amendments of 1984. You may use the RCRAInfo Search to determine identification and location data for specific hazardous waste handlers, and to find a wide range of information on treatment, storage, and disposal facilities regarding permit/closure status, compliance with Federal and State regulations, and cleanup activities.
6) https://www3.epa.gov/enviro/facts/radinfo/search.html
The Radiation Information Database (RADINFO) Search allows you to retrieve selected data from RADINFO database in Envirofacts regarding facilities EPA regulates for radiation or radioactivity.
7) https://archive.epa.gov/enviro/html/icr/web/html/icr_query.html
The Information Collection Rule (ICR) query form allows you to retrieve disinfection byproduct and microbial reports at the national, state, and utility levels.
REMINDER:
The ICR data were collected as part of a national research project to support development of national drinking water standards. They should NOT be used to determine local water systems compliance with drinking water standards, nor should they be used to make personal judgements about individual health risks.
The ICR data were collected from July 1997 to December 1998. All the data have been verified, and the database is complete. EPA will use the data to identify national and regional patterns, not to reach system-by-system or treatment plant-by-treatment plant conclusions. The data that you will see are from individual samples. What EPA will evaluate is the degree to which the samples represent overall water quality.
Several states (and two U.S. territories) did not participate in the ICR: Montana, North Dakota, Vermont, Wyoming, the Virgin Islands, and Guam.
8) https://www.epa.gov/enviro/icis-air-overview
ICIS-AIR contains compliance and permit data for stationary sources of air pollution (such as electric power plants, steel mills, factories, and universities) regulated by EPA, state and local air pollution agencies. The information in ICIS-AIR is used by the states to prepare State Implementation Plans (SIPs) and to track the compliance status of point sources with various regulatory programs under Clean Air Act.
9) https://www.epa.gov/enviro/frs-query-page 10) https://www.epa.gov/enviro/br-search
The Hazardous Waste Report (Biennial Report) collects data on the generation, management, and minimization of hazardous waste. This provides detailed data on the generation of hazardous waste from large quantity generators and data on waste management practices from treatment, storage, and disposal facilities. The Biennial Report data provide a basis for trend analyses. Data about hazardous waste activities is reported for odd number years (beginning with 1989) to EPA. EPA then provides reports on hazardous waste generation and management activity that accompany the data files.
11) Brownfield maps and grants
Looks like for tri data you can get XML data for a state via: https://iaspub.epa.gov/enviro/efservice/tri_facility/state_abbr/VA/
also frs data can be downloaded as a single csv here: https://www3.epa.gov/enviro/html/fii/downloads/state_files/national_single.zip
I also found a data export for ICIS data here: https://iaspub.epa.gov/enviro/efservice/mv_new_geo_best_picks/ I mirrored it here for easy access: https://archive.org/download/enviro_efservice/enviro_efservice.xml
I was also able to download all the Hazardous Waste Report (Biennial Report) data (I think) I found it here: https://iaspub.epa.gov/enviro/efservice/BRS_WASTE_INFORMATION/CSV and mirrored data here: https://archive.org/download/BRSWASTEINFORMATION UPDATE: that was only the first 10000 entries, currently trying to get them all UPDATE 2: Ok downloaded the first 5,000,001 entries. Not sure how many there are.
Trying to grab superfund documents, but this could be a herculean task, as I cannot find any ftp directory or https database page................rather, I have to crawl through numerous links, from https://semspub.epa.gov/src/search and entering data ONLY in the "region" and "collection type" fields
Unless I'm missing something here and there really is a simpler way to get these, this is an issue that should probably be broken up, say by region (each region, in turn, contains "administrative records" and "special collections", which must be brought up separately)
A further problem as follows: When I get to a page listing all the documents attached to a particular superfund location, say https://semspub.epa.gov/src/collection/01/AR1 , there is a list of hyperlinks to documents (clicking on "show ALL entries" to save time).....when I try direct download for a given link in the browser, the link shows up as a pdf document, but in downthemall extension, I get nothing
btw, if someone has already tackled this, feel free to close this issue........I can't imagine this collection being safe in the near future