climate-mirror / datasets

For tracking data mirroring progress
201 stars 18 forks source link

Any EPA pages we can save? FAST? #123

Open ghost opened 7 years ago

ghost commented 7 years ago

See http://mobile.reuters.com/article/idUSKBN15906G and https://climatecrocks.com/2017/01/24/trump-to-epa-war-is-peace/

JeremiahCurtis commented 7 years ago

see issue #279 for ftp://newftp.epa.gov/ I'm not sure where epadatacommons stands, as some of the posts are unclear

TechMaz commented 7 years ago

Ok great. Also backup of https://edg.epa.gov/data/ is slowly being added here: https://archive.org/download/egd_epa_gov_data_PUBLIC_bkp

JeremiahCurtis commented 7 years ago

actually, if anyone with a fast connection could pick up ftp://newftp.epa.gov/RSEI/ , it would help immensely. I've been on and off ftp://newftp.epa.gov/RSEI/Version233_RY2012/Aggregated_Grid_Cell_Data/ on visual wget, and the bigger files are downloading ridiculously slow. If anyone knows of a http or https directory for the ftp://newftp.epa.gov/ data, that would also help. I seem to be getting marginally better download speed on https directories (say https://www1.ncdc.noaa.gov/pub/data/ for example) than ftp ones. I don't know if that is due to overload on the ftp directories or what

JeremiahCurtis commented 7 years ago

referring to issue #279, the primary folder on the ftp://newftp.epa.gov/ that was not yet downloaded was the RSEI folder. I don't know if this helps, but I stumbled across the following: https://www.epa.gov/rsei/ways-get-rsei-results#microdata This page refers to https://sites.sesync.org/rsei/ and https://pages.awscloud.com/public-data-sets-epa-rsei.html I don't know if the lack of a .gov domain means the RSEI Version 2.3.4 - All Years Dataset (1988-2014, by year) dataset is safe or not; perhaps someone else can explain further The RSEI Version 2.3.5 - Most Current Three Years Dataset (2013-2015, by year) is apparently available only via ftp://newftp.epa.gov/

This might be quicker downloading than the ftp directory, and it looks like the same dataset.

I don't have the space to download ftp://newftp.epa.gov/RSEI/ , unfortunately

gofrogs2013 commented 7 years ago

@mxplusb Just saw your post here from a couple weeks ago about contacting the EPA. In the above posts Jeremiah mentions issue #279 where we have had issues with slow connection speeds, particularly on 3 CSV files that are over 100 GB each. One user says it would take 9 days to download one of those files, which is the same as others have been getting, and we'll have to see if this person is successful in downloading one of them. In the meantime, I've filed a FOIA request for all data at newftp.epa.gov. My hope is that I can get it all on an external HD. You can see my request at the link below and feel free to let me know if it looks good or if I should make any amendments to the request. https://foiaonline.regulations.gov/foia/action/public/view/request?objectId=090004d281137e25

JeremiahCurtis commented 7 years ago

Not necessarily directly pertaining to climate change, but likely in danger nonetheless: 1) Toxic Release Inventory: https://www.epa.gov/toxics-release-inventory-tri-program/tri-basic-data-files-calendar-years-1987-2015 (simple CSV files that are giving my browser fits when downloading) more detailed data at https://www.epa.gov/enviro/tri-search .....not sure if there is a ftp directory for this information 2) Substance Registry Services: at https://iaspub.epa.gov/sor_internet/registry/substreg/searchandretrieve/substancesearch/search.do , I don't see any viable options for downloading the registry in a single shot.....someone else might have some more knowledge on this one?

will add more when the ORNL download finishes; it's tying up a ton of my bandwidth, but if you go to https://www.epa.gov/enviro/tri-search you will see a box on the right titled "System Data Searches", with the following links:

Multisystem
BR
Brownfields/Cleanups
Cleanups
ECHO/IDEA
FRS
    EZ Search
    Organization Search
Greenhouse Gas
    Customized Search
ICIS
ICIS-AIR
ICR
IGMS
Locational Information
    Locational Search
PCS
    Customized Search
RADInfo
RadNet
    Customized Search
RCRAInfo
SDWIS
SEMS
SRS
TRI
    TRI Explorer
    TRI Search
    Form R Search
    Form R & A Download
    EZ Search
    Customized Search
    Pollution Prevention
TSCA
UV Index

These would contain the datasets most at risk of elimination......I don't know if they're contained on the ftp or newftp folders or not

TechMaz commented 7 years ago

Note: I'm almost done uploading all of the content from scraping edg.epa.gov/data/PUBLIC . In the meanwhile, I have been uploading some small relevant subsets of the data here:

EQI Technical reports from edg.epa.gov/data/PUBLIC/ORD/NHEERL/EQI here: https://archive.org/details/EQITECHNICALREPORTFINAL ,

Leaking Underground Storage Tank Documents from edg.epa.gov/data/PUBLIC/R9/leakingUST here: https://archive.org/details/EPALeakingUST ,and

Navajo Nation Atlas Radiation Data from edg.epa.gov/data/PUBLIC/R9 https://archive.org/details/NavajoNationAtlas_Radiation .

TechMaz commented 7 years ago

Ok finished archive of web crawl of edg.epa.gov/data can be found here: https://archive.org/details/egd_epa_gov_data_PUBLIC_bkp Some files failed to download due to server error or being missing

ghost commented 7 years ago

Excellent. Thank you. The Azimuth Backup Project also has two independently derived copies.

On Thu, Feb 16, 2017, at 22:24, Steven Mazliach wrote:

Ok finished archive of web crawl of edg.epa.gov/data can be found here: https://archive.org/details/egd_epa_gov_data_PUBLIC_bkp Some files failed to download due to server error or being missing — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub[1], or mute the thread[2].

Links:

  1. https://github.com/climate-mirror/datasets/issues/123#issuecomment-280539367
  2. https://github.com/notifications/unsubscribe-auth/AD3HB3kITkfP3wKjCk9tYfB7rZUM1oAmks5rdRLbgaJpZM4LtGBb
TechMaz commented 7 years ago

@empirical-bayesian Good to know. Do they have the ftp data or the web-crawled version? There are different files in each, some in common.

TechMaz commented 7 years ago

Also there are some other EPA pages we should look at, regarding Superfund projects. Mentioned here: #329

TechMaz commented 7 years ago

Has anyone tackled the epa publication lists? http://nepis.epa.gov/EPA/html/pubindex.html

hkuchampudi commented 7 years ago

@TechMaz the EPA publications lists are a bit trickier because they are saving each page of each document as a TIF image, but I created a script and I am slowly mining the relevant metadata for each of the documents. Once I have the metadata, I can host it and create a python script for someone who has the storage to download the documents.