Intermittent buffer_size errors from ECCO data portal

rabernat commented 4 years ago

I am running the following code:

from xmitgcm import llcreader
import dask
model = llcreader.known_models.ECCOPortalLLC2160Model()
myiter = model.iter_start + 100 * model.iter_step
ds = model.get_dataset(k_levels=range(0, 60), k_chunksize=5,
                       varnames=['Theta', 'Salt'], read_grid=False,
                       iter_start=myiter, iter_stop=myiter+1,
                       type='latlon')

ny = ds.dims['j']
nx = ds.dims['i']
dsl = ds.isel(time=0, j=slice(ny//5, 2*ny//5), i=slice(0, nx//4))
display(dsl)

with dask.config.set(scheduler='single-threaded'):
    with ProgressBar():
        dsl.load()

I intermittently encounter errors such as this one at random points in the computation:

/srv/conda/envs/notebook/lib/python3.7/site-packages/xmitgcm/llcreader/llcmodel.py in _get_facet_chunk(store, varname, iternum, nfacet, klevels, nx, nz, dtype, mask_override)
    424         file.seek(read_offset)
    425         buffer = file.read(read_length)
--> 426         data = np.frombuffer(buffer, dtype=dtype)
    427         assert len(data) == (end - start)
    428 

ValueError: buffer size must be a multiple of element size

The fact that this is intermittent suggests some sort of transport problem.

An effective workaround was to use dask retries:

with dask.config.set(scheduler='single-threaded'):
    with ProgressBar():
        dsl = dask.compute(dsl, retries=10)

ocesaulo commented 4 years ago

Thanks for this! I've been running into these quite often over the past month or so. It was very intermittent, and it def seemed to be external problems because I was running code that worked like a charm before. The only way around I was finding was taking smaller data at a time to decrease the transfer.

Mikejmnez commented 4 years ago

Thanks from my end too! I've been getting those and hadn't figured out a work around. Pretty random errors, and hard to reproduce!

rabernat commented 4 years ago

Yes, these errors indicate that the underlying system serving the data, i.e. the ECCO data portal, is unreliable in some way. Unfortunately, that system is run by NASA and is basically out of our control, so it's not clear how we could help improve its reliability.

If I were in charge, I would just drop the data in Google Cloud Storage and call it a day. 😉

antonimmo commented 4 years ago

If I were in charge, I would just drop the data in Google Cloud Storage and call it a day. 😉

Just being curious. If someone managed get the data in their own HTTP server, Google Drive or somewhere else, is there a way to tell llcreader.ECCOPortalLLC4320Model to point to that location instead? Any advice?

rabernat commented 4 years ago

Absolutely, that would work just fine. You would just create a custom subclass of LLC4320Model and point it at your server. Here is the one for the ECCO portal. It's pretty simple.

https://github.com/MITgcm/xmitgcm/blob/f5ee774f6ee3788d48f32a9bd1fbe02ebea9449b/xmitgcm/llcreader/known_models.py#L87-L96

You could do this in public (and share it here), or in private, and just provide access to a select group. Would also work with and FTP server or anything else supported by filesystem-spec. (You could even put it in Dropbox! 😆 )

antonimmo commented 4 years ago

Nice! I'll give it a try with a subset of the data, and see how it behaves. Thanks!

dhruvbalwada commented 4 years ago

Thanks for this. Why do you need to set dask.config.set(scheduler='single-threaded') when doing this?

When I don't use that option, I am getting an error "ClientOSError: Cannot write to closing transport" (at least sometimes). Wondering, if that also related to some NASA side server issues?

dhruvbalwada commented 4 years ago

Also, btw it seems like @rabernat suggested solution is also failing sometimes on the server and 10 retries are not enough. I am managing to get failures even with retries set at 1000+. Also, the download speeds are around 0.35MBps for me, which seems really slow.

This is the code I am using to download data from a small region (with multiple Z levels) - https://gist.github.com/dhruvbalwada/0b1dc9c7002c278056f6e3320a45b9da . I had to switch to something where I am downloading one snapshot (~150MB) in time at a time, since downloading all time steps at once (~1TB) was not going to work. However, it is taking almost 10mins to download 150MB, which would put me at about 2 months to download one years worth of data. Was it silly to think that I could download this data using this platform (since it is so huge)? or are things going so slow because of some server side issues? or are things slow because I am requesting multiple levels at once? Any help on downloading this data faster would be appreciated.

I wonder if it is possible to report to NASA that the server is being unreliable? @menemenlis and @ryanspaulding

ocesaulo commented 4 years ago

Yes @dhruvbalwada that is something I was meaning to report here earlier, if just for reference: this solution seems to work well for smaller chunks, otherwise it often fails. This intermittentcy has been ongoing for the past 3 weeks I want to say.

Mikejmnez commented 4 years ago

This is the code I am using to download data from a small region (with multiple Z levels) - https://gist.github.com/dhruvbalwada/0b1dc9c7002c278056f6e3320a45b9da . I had to switch to something where I am downloading one snapshot (~150MB) in time at a time, since downloading all time steps at once (~1TB) was not going to work. However, it is taking almost 10mins to download 150MB, which would put me at about 2 months to download one years worth of data. Was it silly to think that I could download this data using this platform (since it is so huge)? or are things going so slow because of some server side issues? or are things slow because I am requesting multiple levels at once? Any help on downloading this data faster would be appreciated.

@dhruvbalwada I am experiencing similar issues. I even decided to first transform some LLC4320 data (1-3 faces) into a single array (with no faces) in a similar fashion as to what llcreader does (loading into memory faces at a time), but can only do it one or two snapshots at a time (it takes ~ 1-2hrs if I do two snapshots, which after storing as Zarr files -> ~ O(10GB)). Even then most of the time I get buffer errors. Also not sure the retries=10 is a good long term solution when trying to download data, given the speed of the transfer, the size of the LLC4320 data and the unreliable issues associated with the ECCO Portal.

dhruvbalwada commented 4 years ago

I had emailed NASA support about this issue, they replied this:

Hi Dhruv

Just wanted to provide an update. We are currently working on upgrading our server and networking infrastructure which should be deployed in the next few weeks. 
We will contact you when the upgrade is complete, and possibly work with you to get an external assessment of the improvements. Hope that’s okay with you. 

Thanks!

--
Shubha Ranjan
Big Data and Analytics Lead
Advanced Computing Branch
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035
650.604.1918

So, hopefully within a few weeks our troubles might be resolved.

antonimmo commented 4 years ago

So, hopefully within a few weeks our troubles might be resolved.

I hope that also results in higher transfer rates. Let's wait a couple more weeks.

dhruvbalwada commented 4 years ago

I heard back from NASA, and they said that they have updated their servers - and they anticipate that these problems should have been solved. It would be great if few people here can start testing to see if things are working, and post here if there are still problems.

Update: I tried again and ran into the same old problem: https://gist.github.com/dhruvbalwada/86f22c8b3be58fc24a9147524a5584e6 . Have informed them, and I believe they are looking into it.

antonimmo commented 4 years ago

I heard back from NASA, and they said that they have updated their servers - and they anticipate that these problems should have been solved. It would be great if few people here can start testing to see if things are working, and post here if there are still problems.

My two cents: I've been trying to intensively download some data for the last 7 days, but there are still some intermittencies which typically don't last too long until I decrease the number of connections (as I'm doing it in parallel with a custom retry policy), but it gets completely stuck at times, even for 2+ hours. When that happens, I can validate that ECCO data portal shows the data as "temporarily unavailable".

Actually, my experience is worse than before the server update. Doesn't seem to improve soon.

rabernat commented 4 years ago

Thanks for everyone for your persistence on this. I'm sorry for the frustration you are experiencing.

We are working on transferring some of the data to Open Storage Network, where hopefully it will be more reliable.

dhruvbalwada commented 4 years ago

Hi everyone. I have been in further touch with NASA to solve this, and it seems like now the problem has been resolved. They have replaced a single-threaded NFS server with a multi-threaded one in hopes that it would help

It would be nice if a few of us can give it a spin before we close this down. I tried downloading a little set of the data and was able to do it too.

It seems like some other people are already having success - eg. https://github.com/MITgcm/xmitgcm/issues/232

antonimmo commented 4 years ago

I can confirm that. I have been downloading data for the last week without any issues.

dhruvbalwada commented 4 years ago

@rabernat - we should close this for now.

Mikejmnez commented 4 years ago

I was running into some issues before

Exception: ServerDisconnectedError('Server disconnected')

when trying to download data in zarr format, but not the original buffer_size errors as before. My issues may be about the size of the data I am trying to download (I am trying to download as much data as I can, e.g. see snapshot of my terminal for variable Theta).

For now, the following code seems to be working when trying to download the full GRID data

snapshot

import zarr
import xarray as xr
from xmitgcm import llcreader
import dask
from dask.diagnostics import ProgressBar

model = llcreader.ECCOPortalLLC4320Model()
## get grid coords data associated with scalar and vector variables
DS = model.get_dataset(varnames=['Theta', 'UVEL', 'VVEL'])
DS = DS.reset_coords().drop_vars(['Theta','UVEL', 'VVEL'])

DS = DS.chunk({'i':4320, 'j':4320, 'face':3, 'k':1, 'k_p1':1, 'k_l':1, 'k_u':1})

GRID_zarr = '.../GRID/' #path to a node directory
compressor = zarr.Blosc(cname='zstd', clevel=3, shuffle=2)
encoding = {vname: {'compressor': compressor} for vname in DS.data_vars}
DS_delayed = DS.to_zarr(GRID_zarr, mode='w', encoding=encoding, compute=False, consolidated=True)

with dask.config.set(scheduler='single-threaded'):
    with ProgressBar():
        dask.compute(DS_delayed, retries=10)

I still don't understand the ServerDisconnectedError I was getting before, but it may be unrelated with buffer_size. Like I mentioned, I am trying to store locally as much data as I want, and while the code above seems to be working for now it is going kind of slow (30% download, about 300MB in about 1hr 20mins). I haven't tested chunking differently yet...

rabernat commented 4 years ago

(30% download, about 300MB in about 1hr 20mins)

Something is wrong. That speed is absurdly slow.

This could be a case where dask's automatic parallelism is hurting more than helping. Thrashing the service with many simultaneous requests might actually slow things down.

I recommend that you bypass python, dask, etc. completely and just start running some wgets from the command line of the files from the portal. Do some manual timing and see how much variability you get.

This is what I get when I do that:

$ wget https://data.nas.nasa.gov/ecco/download_data.php?file=/eccodata/llc_4320/grid/DXG.data
--2020-11-12 14:21:40--  https://data.nas.nasa.gov/ecco/download_data.php?file=/eccodata/llc_4320/grid/DXG.data
Resolving data.nas.nasa.gov... 2001:4d0:6318:903:198:9:3:128, 2001:4d0:6318:903:198:9:3:129, 198.9.3.129, ...
Connecting to data.nas.nasa.gov|2001:4d0:6318:903:198:9:3:128|:443... failed: Network is unreachable.
Connecting to data.nas.nasa.gov|2001:4d0:6318:903:198:9:3:129|:443... failed: Network is unreachable.
Connecting to data.nas.nasa.gov|198.9.3.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 970444800 (925M) [application/octet-stream]
Saving to: ‘download_data.php?file=%2Feccodata%2Fllc_4320%2Fgrid%2FDXG.data’

ta                          11%[===>                                  ] 108.86M  7.89MB/s

Mikejmnez commented 4 years ago

I recommend that you bypass python, dask, etc. completely and just start running some wgets from the command line of the files from the portal. Do some manual timing and see how much variability you get.

Thanks, will definitely try that! It's been pretty frustrating so far... Any idea about slicing data with wget (in time)? We're interested in spreading the data across multiple nodes/volumes

rabernat commented 4 years ago

Actually, I just realized that you already are doing scheduler='single-threaded', so I'm not sure my suggestion is relevant any more.

The fact is, we are not going to be able to move any significant volume of data from the LLC4320 at 8 MB/s. I think we should follow the path we outlined at the last meeting.

Let's first move the LLC data to OSN without any transformation
From there, we should be able to ingest at a much higher rate with more stability

Another option would be to produce the Zarr data on Pleiades and the copy it using shiftc or bbftp.

Any idea about slicing data with wget (in time)? We're interested in spreading the data across multiple nodes/volumes

I don't understand the question.

Mikejmnez commented 4 years ago

The fact is, we are not going to be able to move any significant volume of data from the LLC4320 at 8 MB/s. I think we should follow the path we outlined at the last meeting.

Ok, sound like a much better option.

I don't understand the question.

Not relevant anymore, but it was related to your suggestion to use wget to download variables (e.g. dXG.data in you example), which seems to have triggered the download of the complete dXG variable. If instead of downloading dXG I intended to download (without slicing in time) Theta, it may not have fit in the directory...

Regardless, I will then wait to for the data to be moved to OSN.

Thanks Ryan!

rabernat commented 4 years ago

Regardless, I will then wait to for the data to be moved to OSN.

My understanding is that @christophernhill is the one who is going to lead (or delegate) this task. Perhaps Chris could give us an update on the progress?

christophernhill commented 4 years ago

Hi All,

Quick update.

So we are getting about 150 - 200 MB/s mirroring stuff to the OSN location (using bbftp). So not great, but not terrible and stuff is moving across. I can check on transfer progress tomorrow and on any endpoint access that might already be possible.

We are also trying to see if there are ways to speed up further.

Chris

ryanspaulding commented 4 years ago

Hi All, I just wanted to say that we have been investigating variable WAN speeds and have noticed bandwidth bounces all around depending on the path that they are taking to get to data.nas.nasa.gov. Internally from Pleiades we are getting over 100MB/s. We have changed a number of backend things in our infrastructure to increase stability and speed. Also, we have changed out our hardware which is all on 10GB connections. We have had success with other organizations by helping with the network path to our servers in the past. If you want our Network team to work with you directly to help improve your speeds please submit a ticket to support@nas.nasa.gov and we will work with you to improve your experience.

MITgcm / xmitgcm

Intermittent buffer_size errors from ECCO data portal #210