climate-mirror / datasets

For tracking data mirroring progress
201 stars 18 forks source link

Any EPA pages we can save? FAST? #123

Open ghost opened 7 years ago

ghost commented 7 years ago

See http://mobile.reuters.com/article/idUSKBN15906G and https://climatecrocks.com/2017/01/24/trump-to-epa-war-is-peace/

nickrsan commented 7 years ago

Some of the raw data is in the Internet Archive, but the data portals aren't - if you have the means/time to do some scripting against those applications (or a verification to make sure the data is already available via Internet Archive downloads), I think that may be a priority.

nickrsan commented 7 years ago

Here's history of one portion of the critical sections: http://web.archive.org/web/*/https://www.epa.gov/ghgreporting/ghg-reporting-program-data-sets

geppy commented 7 years ago

@nickrsan I have the means and the time. What data portals are there (aside from the GHG Reporting portal)? Do we just need to make sure these index pages are backed up so that it's practical to navigate the raw data that's already been backed up?

nickrsan commented 7 years ago

My guess is that it's more complicated than that - I think if we know we have the raw data, we're in good shape, but anything we can salvage from the portal is also important, I think. I haven't confirmed if the data in the portal is the same as the data that's listed by year in the links on the data archive tab - that might be the place to start?

nickrsan commented 7 years ago

To clarify (sorry, having discussion in multiple places about this), I think a number of people are concerned specifically about FLIGHT (https://ghgdata.epa.gov/ghgp/main.do) if the main page goes down, but there may be other datasets in there at risk too

geppy commented 7 years ago

Alright, I'll start with FLIGHT.

nickrsan commented 7 years ago

Thank you -if there's anything I can do to help, let me know and I'll either try to assist myself, or find someone who can!

ghost commented 7 years ago

New pages we DON'T have, that is.

Azimuth Backup Project has: geodata.epa.gov www.epa.gov/superfund/superfund-data-and-reports

I have just initiated THREE tickets, at Azimuth Backup, Issues 87, 88, and 89, for Web of:

epa.gov/climatechange (#87)

epa.gov/warm (#88) [the WARM model]

ftp.epa.gov (#89) [I don't know how much of this is CO2 and climate related, and tried to find out, so I'm going for it all]

I don't know if we'll make this. We are severely time limited here.

nickrsan commented 7 years ago

Thanks for the update Jan - that's good to know - I'll split out ftp.epa.gov to issues here and see if we can get people to take individual ones

ghost commented 7 years ago

We need to be careful not to clog each other up! If everyone goes after, say, ftp.epa.gov, we'll saturate the server's bandwidth.

nickrsan commented 7 years ago

yeah - that's what I'm thinking too - on the web side, might be worth checking Internet Archive for those locations before prioritizing. I bet they have them (another copy doesn't hurt, but just a thought)

geppy commented 7 years ago

Okay, I think I'm understanding the workflow: I'm going with brozzler+warcprox. Let me know what you want me to do.

@nickrsan I haven't gotten a Slack invite yet, if it's easier to talk there.

geppy commented 7 years ago

(I'm me@geppy.im.)

ghost commented 7 years ago

So ftp.epa.gov is being mirrored using an Internet2 link. Details later.

matthewberryman commented 7 years ago

I'm using httrack to grab all of the www.epa.gov web content (.epa.gov/)—obvs. not all of the dynamic content will work, nor will I have data sets aside from those linked, but seems like those are being covered. That content is delivered over Akamai but I'm crawling it reasonably slowly anyway. I'll post a link when done (and after I've had dinner, etc.)

geppy commented 7 years ago

@matthewberryman If we use these tools we can get signed archives. It looks like others are fetching data with signatures intact, but they're experiencing heavy load. Could you coordinate with the people in that thread, and with @justinribeiro?

matthewberryman commented 7 years ago

Quick glance at a couple of those tools require graphical environments (or docker builds), and I'm low on time to spin up something different (sorry), so I'll leave httrack (simple and effective with a long track record, IMHO) running for now on a server.

detrout commented 7 years ago

Grabbed the supporting pdfs and most recent warm excel worksheets from www.epa.gov/warm/ that I could find by manually browsing the site

https://drive.google.com/open?id=0B76qh7pWLKB3UjRDbUktbXB5R0U

ghost commented 7 years ago

@matthewberryman Thank you. The Azimuth Backup Project is also trying to take using both httrack and wget, since both have weaknesses, although orthogonal ones. These are really blunt instruments without site structure guidance and doing in a hurry. But what else can we do. I can't expect that anyone would think we can just stand up the sites with the same look and feel. This effort has always primarily been about saving data.

ghost commented 7 years ago

Can people working this report in? We are seeing that the EPA server is under very heavy load, and I worry that in our collective effort to help we won't get one complete copy. There's no problem having a couple of copies, but wanting them all at once ...

lrehmann commented 7 years ago

I've found wget -mk to give a good mirror copy. Here's how to intelligently use it: http://www.createdbypete.com/articles/make-a-local-website-mirror-with-wget/

ghost commented 7 years ago

The Azimuth Backup Project's Issue #87 has been marked as a duplicate of Issue #78, where work is continuing on https://www.epa.gov/climatechange

detrout commented 7 years ago

I have two wgets running www.epa.gov/energy/ and www.epa.gov/warm/

wget --mirror --warc-file= --warc-cdx --page-requisites --html-extension --convert-links --execute robots=off --directory-prefix=. --domains www.epa.gov --user-agent=Mozilla --wait=5 --random-wait

I was hoping the timeout would be enough to slow things down. I can abort if you think its a good idea.

Probably the more important thing was I copied the pdf and excel files for the warm model. Not sure if the website mirror tools download the supporting files.

geppy commented 7 years ago

@empirical-bayesian I'm backing off, as it looks like you've got WARM under control. What isn't being archived?

titaniumbones commented 7 years ago

Hey so https://github.com/edgi-govdata-archiving/ has been working on the EPA site for a while. We've seeded a lot of the Internet Archive so most easily-crawlable resources are already saved by the Internet Archive. We;ve gotten some of hte harder-to-crawl stuff, esp. the EIS collection, and we're also sort of working on some other stuff, like [https://github.com/edgi-govdata-archiving/epa-quantitative](the Superfund site data). Some of that code might help hwne you're trying to build harvesters for similar resources. We have a list of resources that need donwloading -- it's not publicly available right now -- I can try to share it somewhere public if that would be helpful. @dcwalk is maybe the person who is most on top of this right now. If you can help with this stuff we'd be super psyched & we're happy to provide what resources we can.

detrout commented 7 years ago

Its almost like we need a network of caching proxy servers and do the mirroring through that.

dcwalk commented 7 years ago

Don't have much more to add than @titaniumbones right now, hopefully an update shortly and ping me in the meantime if there are questions.

ghost commented 7 years ago

I say take 'em, Diane. It's good to have two copies. I don't have mastery (by any means) of the options on wget, and my options are different than yours, so maybe yours will be a better mirror.

On Wed, Jan 25, 2017, at 12:47, Diane Trout wrote:

I have two wgets running www.epa.gov/energy/ and www.epa.gov/warm/

wget --mirror --warc-file= --warc-cdx --page-requisites --html- extension --convert-links --execute robots=off --directory-prefix=. --domains www.epa.gov --user-agent=Mozilla --wait=5 --random-wait I was hoping the timeout would be enough to slow things down. I can abort if you think its a good idea. Probably the more important thing was I copied the pdf and excel files for the warm model. Not sure if the website mirror tools download the supporting files. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub[1], or mute the thread[2].

Links:

  1. https://github.com/climate-mirror/datasets/issues/123#issuecomment-275179868
  2. https://github.com/notifications/unsubscribe-auth/AD3HB40P6XbNJD_0QH5v6qG-Z9MDNZBMks5rV4q9gaJpZM4LtGBb
ghost commented 7 years ago

You should all be proud to know that we mattered: http://www.eenews.net/greenwire/2017/01/25/stories/1060048975

Keep it up!

detrout commented 7 years ago

@empirical-bayesian I aborted before I saw your message. However the two warc files I have are pretty big, so I suspect I got the subpages and quite a bit more.

778M Jan 25 11:02 energy/energy.warc.gz 421M Jan 25 11:02 warm/epa-warm.warc.gz

I'm going to spend a bit of time learning about some warc tools to see if I can figure out whats actually in the archives.

siennathesane commented 7 years ago

Million dollar question, have we contacted the EPA to see if they can assist us with getting the data?

adinbied commented 7 years ago

A bit late to the party, but are there any pages I can help out with? It seems like alot of the large pages are being handled by other people.

geppy commented 7 years ago

@adinbied Yes! We're talking in Slack: @mxplusb is preparing to go through and clean up the issues so they're easier to deal with. Have you filled out the Slack invite form? If so you might want to ping @nickrsan via the other contact methods.

matthewberryman commented 7 years ago

Hi everyone, just checking in: I'll definitely look at the other tools for future reference (thanks @geppy)-in my case I have a bunch of well-connected Linux VMs running (and limited time due to family commitments) so it was just easier in this case (for me) to run httrack. It has some intelligence around rates, and other defaults built in that are fiddly to set up with command line flags in wget—perhaps worth agreeing and documenting on a set to use? thanks @detrout for listing yours, they seem sensible to use—though as always both tools have limitations as @empirical-bayesian points out. One limitation I've noticed is that when httrack is processing pages as part of its crawl, it's not parallel (maxing out at 1 core).

@empirical-bayesian I think there's value in capturing web pages on top of data, from a sci comms perspective as well as the context for the data, I have seen some research data sets so badly documented as to be unusable.

@detrout the main EPA web site www.epa.gov and the Spanish translation are on Akamai's CDN, so that helps a bit, but the ftp site and alternate www3.epa.gov amongst others are not.

I'm up to 13.6GB and counting. As of writing, though, the climate change page is still up.

ghost commented 7 years ago

ftp.epa.gov was copied by The Azimuth Backup Project:

FINISHED --2017-01-25 20:20:15-- Total wall clock time: 1d 22h 23m 27s Downloaded: 79109 files, 690G in 1d 9h 58m 10s (5.78 MB/s)

We have decided not to share the mirror link yet, and will do so when we share all our mirror links. This is to preserve ``operational integrity.'' That means that when all the datasets are replicated at other locations in addition to the mirror site, we'll reveal the links on the mirrors. We've thought about this carefully, and think it a prudent step, particularly since the effort at large has made the limelight. It's good to publicize somethings, not so much others.

We are also mirroring https://www.epa.gov/climatechange and https://www.epa.gov/warm. We are doing this using both httrack and wget.

For httrack, https://www.epa.gov/climatechange is currently at 5.8 Gb. For httrack, https://www.epa.gov/warm is currently at 13 Gb. For wget, https://www.epa.gov/climatechange is currently at 11 Gb, and https://www.epa.gov/warm is currently at 7 Gb.

ghost commented 7 years ago

That's really dicey for them, as you might imagine. Might be the difference between having a job or not.

On the other hand, while I wish everyone there full employment, the administration should know, that, just as in the case of a mass firing from a private firm, people who have been forced out are likely to be highly motivated to use their skills in ways which both further their own purposes and, ahem, say make life difficult for their former employer.

I hope people in the administration are well aware of that.

On Wed, Jan 25, 2017, at 15:15, Mike Lloyd wrote:

Million dollar question, have we contacted the EPA to see if they can assist us with getting the data? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub[1], or mute the thread[2].

Links:

  1. https://github.com/climate-mirror/datasets/issues/123#issuecomment-275220466
  2. https://github.com/notifications/unsubscribe-auth/AD3HB0QuDH39eqwBOQ3PUDWN9S0sxiN_ks5rV61ngaJpZM4LtGBb
colinbeier commented 7 years ago

true… but not nearly as dicey if the data transfer was completed by the book and prior to Jan. 20. :)

On Jan 25, 2017, at 10:06 PM, Jan Galkowski notifications@github.com<mailto:notifications@github.com> wrote:

That's really dicey for them, as you might imagine. Might be the difference between having a job or not.

On the other hand, while I wish everyone there full employment, the administration should know, that, just as in the case of a mass firing from a private firm, people who have been forced out are likely to be highly motivated to use their skills in ways which both further their own purposes and, ahem, say make life difficult for their former employer.

I hope people in the administration are well aware of that.

On Wed, Jan 25, 2017, at 15:15, Mike Lloyd wrote:

Million dollar question, have we contacted the EPA to see if they can assist us with getting the data? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub[1], or mute the thread[2].

Links:

  1. https://github.com/climate-mirror/datasets/issues/123#issuecomment-275220466
  2. https://github.com/notifications/unsubscribe-auth/AD3HB0QuDH39eqwBOQ3PUDWN9S0sxiN_ks5rV61ngaJpZM4LtGBb

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/climate-mirror/datasets/issues/123#issuecomment-275298066, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHCIumlytk0i4X2kbgWh8JUzgQ20Itxmks5rWA3GgaJpZM4LtGBb.

BethTrask commented 7 years ago

Hi everyone: Does anyone know if this data is being captured: https://www.epa.gov/air-emissions-inventories/national-emissions-inventory-nei

I can't find a current Issue for it.

matthewberryman commented 7 years ago

@BethTrask the data files linked from the data pages linked from https://www.epa.gov/air-emissions-inventories/national-emissions-inventory-nei are on the ftp site, and all of the ftp site has been saved https://github.com/climate-mirror/datasets/issues/123#issuecomment-275296533

ghost commented 7 years ago

In addition to Matthew's copy, The Azimuth Backup Project is also replicating this dataset. Note however that we found the following:

Emissions Inventory System (EIS) Gateway The EIS Gateway, the first component of the Emissions Inventory System (EIS), was developed to provide registered EPA, State, local and Tribal users with access to emissions inventory data. Registered EPA, State, local and Tribal users can access facility inventory and emissions data for sources in their jurisdiction. The EIS Gateway allows users to manage their profile information to: add, view and edit facility inventory information for their agency; extract data by running reports; access reporting codes and; request support from the EPA through a central message center. For EPA, State, Local, and Tribal users who want access to the EIS Gateway, follow the steps outlined in the EIS Users Manual found at How Do I Request Access to the EIS GAteway. ONLY EPA staff, State, local and Tribal agency staff will be provided access to the EIS Gateway.

That is, there is a portion of the site which is inaccessible to the public. Unsurprisingly, this is because this Gateway allows changes to data, and should not be made available to the public.

I point it out simply to indicate that the site replication will necessarily be complete -- and I would highly advise no one to try and crack it -- but also to indicate from a legal perspective, even if I am not an attorney, that it demonstrates an agency could if they sought lock down data from the public. That this kind of device is not imposed on the datasets ClimateMirror and allies are copying means, in addition to Executive Orders, Open Data law, and Copyright law of the U.S., a claim of this being non-public information would be that more difficult to defend.

On Fri, Jan 27, 2017, at 17:47, Matthew Berryman wrote:

@BethTrask[1] the data files linked from the data pages linked from https://www.epa.gov/air-emissions-inventories/national-emissions-inventory-nei are on the ftp site, and all of the ftp site has been saved

123 (comment)[2]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub[3], or mute the thread[4].

Links:

  1. https://github.com/BethTrask
  2. https://github.com/climate-mirror/datasets/issues/123#issuecomment-275296533
  3. https://github.com/climate-mirror/datasets/issues/123#issuecomment-275795784
  4. https://github.com/notifications/unsubscribe-auth/AD3HB0IJZ-XyttjNTd8YiimphOM8-rMiks5rWnPkgaJpZM4LtGBb
matthewberryman commented 7 years ago

Here are .tar.bz2 files of directories of the web content I've saved: www.epa.gov (~50GB) and climate.nasa.gov (~2GB). SHA512's at https://gist.github.com/matthewberryman/4e035ffe94b54e9871cd7d82341a0b2d

BethTrask commented 7 years ago

Thank you, @matthewberryman and @empirical-bayesian

TechMaz commented 7 years ago

@detrout I mirrored your epa.gov/warm dataset onto archive.org: https://archive.org/download/warm_v14_construction_demolition_materials

detrout commented 7 years ago

Thanks @TechMaz apparently in the future I should use a better filename than SHA256SUM

ghost commented 7 years ago

Just a caution here-on out: We have our first documented report of Web pages at the EPA being altered, so data from there should hereafter be considered suspicious, and the only data which should be considered acceptable without very careful examination needs to have been grabbed 1st February 2017 or earlier.

JeremiahCurtis commented 7 years ago

can anyone grab https://edg.epa.gov/data/ ?

ghost commented 7 years ago

Just for jollies, I tried, with a wget, and found, in short order:

[jan@azi03 local_data]$ tail edg-epa-gov.log

2017-02-03 19:00:50 URL: https://edg.epa.gov/DataUtils/images/folderClosed.gif [933/933] -> "edg.epa.gov/DataUtils/images/folderClosed.gif" [1] Last-modified header missing -- time-stamps turned off.

2017-02-03 19:00:54 URL:https://edg.epa.gov/data/Public/ [17001] -> "edg.epa.gov/data/Public/index.html" [1] 2017-02-03 19:01:01 URL: https://edg.epa.gov/DataUtils/images/seal-bottom.png *[3979/3979] -> "edg.epa.gov/DataUtils/images/seal-bottom.png" [1] https://edg.epa.gov/data/XXXX:*

2017-02-03 19:01:05 ERROR 404: Not Found.

2017-02-03 19:01:13 URL: https://edg.epa.gov/DataUtils/images/Banner_Data.jpg [102139/102139] -> "edg.epa.gov/DataUtils/images/Banner_Data.jpg" [1] FINISHED --2017-02-03 19:01:13--

Total wall clock time: 1m 15s

Downloaded: 10 files, 825K in 13s (64.1 KB/s)

On Fri, Feb 3, 2017, at 12:56, JeremiahCurtis wrote:

can anyone grab https://edg.epa.gov/data/ ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub[1], or mute the thread[2].

Links:

  1. https://github.com/climate-mirror/datasets/issues/123#issuecomment-277316164
  2. https://github.com/notifications/unsubscribe-auth/AD3HB7SLRR9dSLLxKOHLd_KjVGBUMAdXks5rY2ozgaJpZM4LtGBb
ghost commented 7 years ago

I get a "503" error code.

On Sat, Feb 4, 2017, at 16:26, Steven Mazliach wrote:

Is it just me or did https://edg.epa.gov/data/ just get taken down: ? screen shot 2017-02-04 at 1 24 49 pm — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub[1], or mute the thread[2].

Links:

  1. https://github.com/climate-mirror/datasets/issues/123#issuecomment-277479152
  2. https://github.com/notifications/unsubscribe-auth/AD3HBxxJwstHwCvHfJGMeSQEdWmnuCcaks5rZO0CgaJpZM4LtGBb
TechMaz commented 7 years ago

Nevermind. I deleted my last comment because it seems to just be blocked for my ip. Crawling now from a different ip.

TechMaz commented 7 years ago

Note: There is also a corresponding FTP site that has similar info but many different files and structure: @ ftp://newftp.epa.gov/epadatacommons Has this been archived yet? If not, does someone want to do that?