Open rabernat opened 4 years ago
Thanks for this! I've been running into these quite often over the past month or so. It was very intermittent, and it def seemed to be external problems because I was running code that worked like a charm before. The only way around I was finding was taking smaller data at a time to decrease the transfer.
Thanks from my end too! I've been getting those and hadn't figured out a work around. Pretty random errors, and hard to reproduce!
Yes, these errors indicate that the underlying system serving the data, i.e. the ECCO data portal, is unreliable in some way. Unfortunately, that system is run by NASA and is basically out of our control, so it's not clear how we could help improve its reliability.
If I were in charge, I would just drop the data in Google Cloud Storage and call it a day. 😉
If I were in charge, I would just drop the data in Google Cloud Storage and call it a day. 😉
Just being curious. If someone managed get the data in their own HTTP server, Google Drive or somewhere else, is there a way to tell llcreader.ECCOPortalLLC4320Model
to point to that location instead? Any advice?
Absolutely, that would work just fine. You would just create a custom subclass of LLC4320Model
and point it at your server. Here is the one for the ECCO portal. It's pretty simple.
You could do this in public (and share it here), or in private, and just provide access to a select group. Would also work with and FTP server or anything else supported by filesystem-spec. (You could even put it in Dropbox! 😆 )
Nice! I'll give it a try with a subset of the data, and see how it behaves. Thanks!
Thanks for this. Why do you need to set dask.config.set(scheduler='single-threaded')
when doing this?
When I don't use that option, I am getting an error "ClientOSError: Cannot write to closing transport" (at least sometimes). Wondering, if that also related to some NASA side server issues?
Also, btw it seems like @rabernat suggested solution is also failing sometimes on the server and 10 retries are not enough. I am managing to get failures even with retries set at 1000+. Also, the download speeds are around 0.35MBps for me, which seems really slow.
This is the code I am using to download data from a small region (with multiple Z levels) - https://gist.github.com/dhruvbalwada/0b1dc9c7002c278056f6e3320a45b9da . I had to switch to something where I am downloading one snapshot (~150MB) in time at a time, since downloading all time steps at once (~1TB) was not going to work. However, it is taking almost 10mins to download 150MB, which would put me at about 2 months to download one years worth of data. Was it silly to think that I could download this data using this platform (since it is so huge)? or are things going so slow because of some server side issues? or are things slow because I am requesting multiple levels at once? Any help on downloading this data faster would be appreciated.
I wonder if it is possible to report to NASA that the server is being unreliable? @menemenlis and @ryanspaulding
Yes @dhruvbalwada that is something I was meaning to report here earlier, if just for reference: this solution seems to work well for smaller chunks, otherwise it often fails. This intermittentcy has been ongoing for the past 3 weeks I want to say.
This is the code I am using to download data from a small region (with multiple Z levels) - https://gist.github.com/dhruvbalwada/0b1dc9c7002c278056f6e3320a45b9da . I had to switch to something where I am downloading one snapshot (~150MB) in time at a time, since downloading all time steps at once (~1TB) was not going to work. However, it is taking almost 10mins to download 150MB, which would put me at about 2 months to download one years worth of data. Was it silly to think that I could download this data using this platform (since it is so huge)? or are things going so slow because of some server side issues? or are things slow because I am requesting multiple levels at once? Any help on downloading this data faster would be appreciated.
@dhruvbalwada I am experiencing similar issues. I even decided to first transform some LLC4320 data (1-3 faces) into a single array (with no faces) in a similar fashion as to what llcreader does (loading into memory faces at a time), but can only do it one or two snapshots at a time (it takes ~ 1-2hrs if I do two snapshots, which after storing as Zarr files -> ~ O(10GB)). Even then most of the time I get buffer errors. Also not sure the retries=10
is a good long term solution when trying to download data, given the speed of the transfer, the size of the LLC4320 data and the unreliable issues associated with the ECCO Portal.
I had emailed NASA support about this issue, they replied this:
Hi Dhruv
Just wanted to provide an update. We are currently working on upgrading our server and networking infrastructure which should be deployed in the next few weeks.
We will contact you when the upgrade is complete, and possibly work with you to get an external assessment of the improvements. Hope that’s okay with you.
Thanks!
--
Shubha Ranjan
Big Data and Analytics Lead
Advanced Computing Branch
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035
650.604.1918
So, hopefully within a few weeks our troubles might be resolved.
So, hopefully within a few weeks our troubles might be resolved.
I hope that also results in higher transfer rates. Let's wait a couple more weeks.
I heard back from NASA, and they said that they have updated their servers - and they anticipate that these problems should have been solved. It would be great if few people here can start testing to see if things are working, and post here if there are still problems.
Update: I tried again and ran into the same old problem: https://gist.github.com/dhruvbalwada/86f22c8b3be58fc24a9147524a5584e6 . Have informed them, and I believe they are looking into it.
I heard back from NASA, and they said that they have updated their servers - and they anticipate that these problems should have been solved. It would be great if few people here can start testing to see if things are working, and post here if there are still problems.
My two cents: I've been trying to intensively download some data for the last 7 days, but there are still some intermittencies which typically don't last too long until I decrease the number of connections (as I'm doing it in parallel with a custom retry policy), but it gets completely stuck at times, even for 2+ hours. When that happens, I can validate that ECCO data portal shows the data as "temporarily unavailable".
Actually, my experience is worse than before the server update. Doesn't seem to improve soon.
Thanks for everyone for your persistence on this. I'm sorry for the frustration you are experiencing.
We are working on transferring some of the data to Open Storage Network, where hopefully it will be more reliable.
Hi everyone. I have been in further touch with NASA to solve this, and it seems like now the problem has been resolved. They have replaced a single-threaded NFS server with a multi-threaded one in hopes that it would help
It would be nice if a few of us can give it a spin before we close this down. I tried downloading a little set of the data and was able to do it too.
It seems like some other people are already having success - eg. https://github.com/MITgcm/xmitgcm/issues/232
I can confirm that. I have been downloading data for the last week without any issues.
@rabernat - we should close this for now.
I was running into some issues before
Exception: ServerDisconnectedError('Server disconnected')
when trying to download data in zarr
format, but not the original buffer_size
errors as before. My issues may be about the size of the data I am trying to download (I am trying to download as much data as I can, e.g. see snapshot of my terminal for variable Theta
).
For now, the following code seems to be working when trying to download the full GRID data
import zarr
import xarray as xr
from xmitgcm import llcreader
import dask
from dask.diagnostics import ProgressBar
model = llcreader.ECCOPortalLLC4320Model()
## get grid coords data associated with scalar and vector variables
DS = model.get_dataset(varnames=['Theta', 'UVEL', 'VVEL'])
DS = DS.reset_coords().drop_vars(['Theta','UVEL', 'VVEL'])
DS = DS.chunk({'i':4320, 'j':4320, 'face':3, 'k':1, 'k_p1':1, 'k_l':1, 'k_u':1})
GRID_zarr = '.../GRID/' #path to a node directory
compressor = zarr.Blosc(cname='zstd', clevel=3, shuffle=2)
encoding = {vname: {'compressor': compressor} for vname in DS.data_vars}
DS_delayed = DS.to_zarr(GRID_zarr, mode='w', encoding=encoding, compute=False, consolidated=True)
with dask.config.set(scheduler='single-threaded'):
with ProgressBar():
dask.compute(DS_delayed, retries=10)
I still don't understand the ServerDisconnectedError
I was getting before, but it may be unrelated with buffer_size
. Like I mentioned, I am trying to store locally as much data as I want, and while the code above seems to be working for now it is going kind of slow (30% download, about 300MB in about 1hr 20mins). I haven't tested chunking differently yet...
(30% download, about 300MB in about 1hr 20mins)
Something is wrong. That speed is absurdly slow.
This could be a case where dask's automatic parallelism is hurting more than helping. Thrashing the service with many simultaneous requests might actually slow things down.
I recommend that you bypass python, dask, etc. completely and just start running some wget
s from the command line of the files from the portal. Do some manual timing and see how much variability you get.
This is what I get when I do that:
$ wget https://data.nas.nasa.gov/ecco/download_data.php?file=/eccodata/llc_4320/grid/DXG.data
--2020-11-12 14:21:40-- https://data.nas.nasa.gov/ecco/download_data.php?file=/eccodata/llc_4320/grid/DXG.data
Resolving data.nas.nasa.gov... 2001:4d0:6318:903:198:9:3:128, 2001:4d0:6318:903:198:9:3:129, 198.9.3.129, ...
Connecting to data.nas.nasa.gov|2001:4d0:6318:903:198:9:3:128|:443... failed: Network is unreachable.
Connecting to data.nas.nasa.gov|2001:4d0:6318:903:198:9:3:129|:443... failed: Network is unreachable.
Connecting to data.nas.nasa.gov|198.9.3.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 970444800 (925M) [application/octet-stream]
Saving to: ‘download_data.php?file=%2Feccodata%2Fllc_4320%2Fgrid%2FDXG.data’
ta 11%[===> ] 108.86M 7.89MB/s
I recommend that you bypass python, dask, etc. completely and just start running some wgets from the command line of the files from the portal. Do some manual timing and see how much variability you get.
Thanks, will definitely try that! It's been pretty frustrating so far... Any idea about slicing data with wget (in time)? We're interested in spreading the data across multiple nodes/volumes
Actually, I just realized that you already are doing scheduler='single-threaded'
, so I'm not sure my suggestion is relevant any more.
The fact is, we are not going to be able to move any significant volume of data from the LLC4320 at 8 MB/s. I think we should follow the path we outlined at the last meeting.
Another option would be to produce the Zarr data on Pleiades and the copy it using shiftc or bbftp.
Any idea about slicing data with wget (in time)? We're interested in spreading the data across multiple nodes/volumes
I don't understand the question.
The fact is, we are not going to be able to move any significant volume of data from the LLC4320 at 8 MB/s. I think we should follow the path we outlined at the last meeting.
Ok, sound like a much better option.
I don't understand the question.
Not relevant anymore, but it was related to your suggestion to use wget to download variables (e.g. dXG.data in you example), which seems to have triggered the download of the complete dXG variable. If instead of downloading dXG I intended to download (without slicing in time) Theta
, it may not have fit in the directory...
Regardless, I will then wait to for the data to be moved to OSN.
Thanks Ryan!
Regardless, I will then wait to for the data to be moved to OSN.
My understanding is that @christophernhill is the one who is going to lead (or delegate) this task. Perhaps Chris could give us an update on the progress?
Hi All,
Quick update.
So we are getting about 150 - 200 MB/s mirroring stuff to the OSN location (using bbftp). So not great, but not terrible and stuff is moving across. I can check on transfer progress tomorrow and on any endpoint access that might already be possible.
We are also trying to see if there are ways to speed up further.
Chris
Hi All, I just wanted to say that we have been investigating variable WAN speeds and have noticed bandwidth bounces all around depending on the path that they are taking to get to data.nas.nasa.gov. Internally from Pleiades we are getting over 100MB/s. We have changed a number of backend things in our infrastructure to increase stability and speed. Also, we have changed out our hardware which is all on 10GB connections. We have had success with other organizations by helping with the network path to our servers in the past. If you want our Network team to work with you directly to help improve your speeds please submit a ticket to support@nas.nasa.gov and we will work with you to improve your experience.
I am running the following code:
I intermittently encounter errors such as this one at random points in the computation:
The fact that this is intermittent suggests some sort of transport problem.
An effective workaround was to use dask retries: