climate-mirror / datasets

For tracking data mirroring progress
201 stars 18 forks source link

newftp.epa.gov #279

Open Serubin opened 7 years ago

Serubin commented 7 years ago

Current ftp contents: 899M ./AIR_QUALITY_DATA
0 ./CAM_HRA
2.3G ./CERCLA108B
406G ./COMPTOX
406G ./Computational_Toxicology_Data (Looks like a duplicate of the above)
2.2G ./EJSCREEN
33G ./EPADataCommons
44G ./GKM_DOCUMENTS
1.0T ./RSEI
7.5G ./RTPGIS
62M ./STANDARD_MINE
1.0K ./TESTAREA
1.9T .

Currently pulled down on my machine:

899M ./AIR_QUALITY_DATA 31M ./GKM_DOCUMENTS 2.2G ./EJSCREEN 14G ./EPADataCommons 2.3G ./CERCLA108B 4.0K ./CAM_HRA 32G ./COMPTOX 52G .

I intend to make my mirror public, but that may have to wait until the weekend.

Plazmaz commented 7 years ago

Looks like http://newftp.epa.edu/ is down

mheistermann commented 7 years ago

@Plazmaz it's ftp://newftp.epa.gov/

Serubin commented 7 years ago

Updated current download status. If anyone want's to start downloading other parts of this feel free - it's rate limited at 500kb/s so this is a pretty slow process.

JeremiahCurtis commented 7 years ago

tried wget but it stopped because of login issues

JeremiahCurtis commented 7 years ago

--11:42:18-- ftp://newftp.epa.gov/EPADataCommons/ (try:20) => `C:/Users/user/Music/newftp.epa.gov/EPADataCommons/.listing' Connecting to newftp.epa.gov|134.67.100.58|:21... connected. Logging in as anonymous ... The server refuses login. Giving up.

unlink: No such file or directory

FINISHED --11:42:18-- Downloaded: 0 bytes in 0 files

Serubin commented 7 years ago

@JeremiahCurtis Give it another try. That happens every so often.

These still need to be downloaded. The RSEI directory looks daunting - might split that up a bit. 1.0T ./RSEI 7.5G ./RTPGIS 62M ./STANDARD_MINE 1.0K ./TESTAREA

JeremiahCurtis commented 7 years ago

ftp://newftp.epa.gov/RSEI/Version233_RY2012/Aggregated_Grid_Cell_Data/

working on the above csv files; since wget is having problems, i am doing direct downloads

JeremiahCurtis commented 7 years ago

this may take awhile

JeremiahCurtis commented 7 years ago

direct download not working either...not sure what's up

Serubin commented 7 years ago

It appears the server is gone. ftp://ftp.epa.gov is still up

Serubin commented 7 years ago

Final data count: 15G ./newftp.epa.gov 899M ./AIR_QUALITY_DATA 2.1G ./GKM_DOCUMENTS 2.2G ./EJSCREEN 2.3G ./CERCLA108B 4.0K ./CAM_HRA 36G ./COMPTOX 57G .

JeremiahCurtis commented 7 years ago

now the direct download is working again...

JeremiahCurtis commented 7 years ago

there are 3 massive csv files at ftp://newftp.epa.gov/RSEI/Version233_RY2012/Disaggregated_Microdata/

each is about 110 GB

Serubin commented 7 years ago

@JeremiahCurtis Pull down whatever you can - I'm unable to access

JeremiahCurtis commented 7 years ago

working on it

JeremiahCurtis commented 7 years ago

direct download is kind of ineffective for a 110 GB file, though. If my browser crashes, I have to start all over....any ideas?

JeremiahCurtis commented 7 years ago

I'm also running downthemall on thousands of files from a lot of the directories at http://cdiac.ornl.gov/ftp/ This doesn't help direct download speeds, but if someone can confirm that the above ftp has been completely mirrored, I will end the dta session and that should speed up direct download.....thanks

Serubin commented 7 years ago

Using wget might be good idea.

The download rates are limited to about 500kb/s

JeremiahCurtis commented 7 years ago

ftp://newftp.epa.gov/RSEI/Version233_RY2012/Disaggregated_Microdata/ is anyone else able to access?

lgreenlee commented 7 years ago

I'm looking at this - it looks like the server is reaching its connection limits. Aria2 might be a good option for fast downloads.

On Thu, Jan 26, 2017 at 12:28 PM JeremiahCurtis notifications@github.com wrote:

ftp://newftp.epa.gov/RSEI/Version233_RY2012/Disaggregated_Microdata/ is anyone else able to access?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/climate-mirror/datasets/issues/279#issuecomment-275452756, or mute the thread https://github.com/notifications/unsubscribe-auth/ABr2HYKYAfGGIohvuAjThgOPDps0ImHzks5rWNe4gaJpZM4LuVFM .

Serubin commented 7 years ago

I think I've hit my connection limits - I've got to bow out. I've got some amount of data that I can pass off to anyone - or I am happy to grab data from someone who downloaded to try and host the data somewhere.

adinbied commented 7 years ago

While it's not DOWN for me, it's requiring a username and password to connect.

lrehmann commented 7 years ago

The server is responding with

421 Maximum login limit has been reached

Various clients give different messages when the server cannot be reached with the default anonymous credentials. Chrome asks for a username and password when in fact the anonymous credentials are still valid, the server is just overwhelmed.

adinbied commented 7 years ago

OK, didn't know that. Thanks!

ghost commented 7 years ago

We have that subdirectory mirrored along with cdiac.ornl.gov. That subdirectory by itself has about 87 Gb. This is tracked as The Azimuth Backup Project Issue #3. It was one of the first we did.

To everyone, I would not, however, rely upon single copies. It would be good to know someone else has it, too, or could replicate ours elsewhere.

On Thu, Jan 26, 2017, at 12:13, JeremiahCurtis wrote:

I'm also running downthemall on thousands of files from a lot of the directories at http://cdiac.ornl.gov/ftp/ This doesn't help direct download speeds, but if someone can confirm that the above ftp has been completely mirrored, I will end the dta session and that should speed up direct download.....thanks — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub[1], or mute the thread[2].

Links:

  1. https://github.com/climate-mirror/datasets/issues/279#issuecomment-275448323
  2. https://github.com/notifications/unsubscribe-auth/AD3HB7wUilPirtHL1mQRWTDdVhBFz8Drks5rWNQlgaJpZM4LuVFM
JeremiahCurtis commented 7 years ago

what is the cdiac ftp mirror address? i followed the link on the main cdiac issue page here, and could not actually find any data.....maybe i'm missing something....thanks

Serubin commented 7 years ago

Given that this data source is going to be taken down at anytime (and that the source is crazy slow), I think priority one should be downloading it - even if it's spread across multiple people. We can consolidate and duplicate later.

JeremiahCurtis commented 7 years ago

are we talking about cdiac or the epa ftp?

Serubin commented 7 years ago

I'm talking about epa ftp - that's what this issue is for.

JeremiahCurtis commented 7 years ago

if someone else can get the 3 large files at ftp://newftp.epa.gov/RSEI/Version233_RY2012/Disaggregated_Microdata/, go for it......i have the first file under download but it would weeks at my current download rate....i am having mixed success at ftp://newftp.epa.gov/RSEI/Version233_RY2012/Aggregated_Grid_Cell_Data/

randomvariable commented 7 years ago

Started a sync of ftp://newftp.epa.gov/RSEI/Version233_RY2012/Disaggregated_Microdata/ at about 500KB/s

gofrogs2013 commented 7 years ago

I'm curious if it would be worthwhile to try to make a FOIA request for this information as I'm having the same issue with slow downloads and we could get it on a hard drive or similar, albeit with a fee. The entire dataset could be sent on a 2 TB external HD.

bkirkbri commented 7 years ago

@gofrogs2013 Good idea

bkirkbri commented 7 years ago

Can someone volunteer to coordinate this issue? It's great that so many people are dividing it up to get it done! If one of you could track who has what that would be really helpful. Thanks!

Serubin commented 7 years ago

I've suffered an untimely hard drive failure, I gotta back out. Sorry.

gofrogs2013 commented 7 years ago

I went ahead and made a FOIA request for all data in the newftp folder. You can check the progress here: https://foiaonline.regulations.gov/foia/action/public/view/request?objectId=090004d281137e25

gofrogs2013 commented 7 years ago

@bkirkbri Per the previous comment, I've made the FOIA request and added a link. I won't be able to coordinate it beyond that if we still want to try downloading the rest of it (which is probably the case) as I'll be working on NASA ERS files #289 for a while, but I'll post here if they approve the request.

JeremiahCurtis commented 7 years ago

@randomvariable How is the microdata folder moving? I am attempting a grab of the following RSEI subfolders: temp and shapefiles

donbright commented 7 years ago

fyi for anyone trying to look at @empirical-bayesian issue links, they actually refer to https://bitbucket.org/azimuth-backup/azimuth-inventory/issues/89 not the automatically generated github issues (like this #89)

StephWo commented 7 years ago

I'm trying to get those Microdata files. I started with the last one in alphabetical order (Micro2012_2012...) and will go backwards from that. ETA for the first file is in 9 days...

gofrogs2013 commented 7 years ago

@BauerPiepenbrink Is your download still going, and if so do you have the same ETA? Hopefully it will be possible to download these large files, but if not I will try getting them from the agency via FOIA as I mention above.

donbright commented 7 years ago

I just checked and ./AIR_QUALITY_DATA only has 58M of data in a single .zip file, which is far less than what @serubin reported above.

does anyone have a public mirror up for cross-checking data?

StephWo commented 7 years ago

@gofrogs2013 steady as a rock. ETA 6d 23h with an average of 130 K/s. It's not fast but reliable so far.

A friend of mine and me used to try to calculate what has better bandwith from europe to china. A Gigabit Internet Uplink or a seacontainer full of Hard-Drives. Getting a physical Backup seems the way to go if possible. Anyway, I keep on nibbling. 27% already done :)

StephWo commented 7 years ago

I shouldn't have jinxed it. Got Interrupted by the server half an hour ago. Continueing now. Make shure to use a download-client with the ability to resume after disconnect

JeremiahCurtis commented 7 years ago

@Serubin hope your hard drive failure doesn't mean your download is irretrievable :)

StephWo commented 7 years ago

So, first file from the Disaggregated_Microdata folder is finally downloaded. Its Micro2012_2012.csv

http://176.9.83.61/InProgress_279/Disaggregated_Microdata/ This link will change later on.

Hashdeep Checksum for that single file:

110831639138,1d94bea31fe0bd03d732e01b7e7d6ab8,9087314828d9736e275d395f749b354676f7f4164a003319c3501257053b8366,Micro2012_2012.csv

gofrogs2013 commented 7 years ago

Disregard the referenced issue above. @BauerPiepenbrink Congrats on the download! Were you able to actually open the file considering its size?

StephWo commented 7 years ago

@gofrogs2013 well, I won't try to open it in whole :) If I run tail -n 40 Micro2012_2012.csv I get the last 40 lines of that file which look like this: 14,1275,2277,5231336,318,1204704,6,3.29918E-08,5.93853E-06,2.18259E-07,0.00000E+00,2.18259E-07,1.55910E+02 14,1275,2278,5231336,318,1204704,6,3.01574E-08,5.42833E-06,4.37626E-09,0.00000E+00,4.37626E-09,3.69841E+00 14,1275,2279,5231336,318,1204704,6,2.75008E-08,4.95015E-06,4.11220E-08,0.00000E+00,4.11220E-08,3.59820E+01

So, as the file extension promised, comma seperated values. If someone really wants to dig into that there seems to be a software for that to basically filter the csv files called Microdata_Extractor.

I will try to download that too if I stumble upon it.

JeremiahCurtis commented 7 years ago

just wondering what we're still missing on the RSEI folder;

I have finished:

Version233_RY2012/Public_Release_Data/CSV version/ Version233_RY2012/Aggregated_Grid_Cell_Data/ Census_XWalks/ Shapefiles/

Serubin commented 7 years ago

@JeremiahCurtis Still working on retrieval.

Picked up another 4TB drive so I should be able to get back to data pulling soon.