Unstable connection using llcreader

abodner commented 1 year ago

Hello,

This is my first time posting here!

I do not have a Plaeides account but for my research I would like to use the llc4320 data. At first, I tried to use llcreader to access the model data, and then run calculations using dask, only saving locally the final result. However, my connection is not stable enough to complete the calculations. Since I only need to subsample the data (10X10 degree boxes of 3D temp, salinity, velocities of the top 700m), I have been trying to transfer the subsampled variables to my local machine. But even for this task my connection brakes and it has been a painfully long process to get the data I need.

Any suggestions? Is there a way to ensure the connection does not break? Is it possible to transfer using another method (e.g. globus or rsync)?

Thanks in advance!

timothyas commented 1 year ago

Hi @abodner! Sorry to hear about your troubles. Would you be able to share the lines of code you're using so we can make the discussion more concrete? Thanks!

abodner commented 1 year ago

Thanks @timothyas! It has indeed been a quite frustrating process!

Here is sample of my code for the variable W (ideally I would have about 20 of these regions for each of the five variables, so far I have only managed to get 6):

lat_min = -45
lat_max = -30
lon_min = -140
lon_max = -125
depth_lim = -700

ds_W_full = model.get_dataset(varnames=['W'], type='latlon')

sel_area_W = np.logical_and(np.logical_and(np.logical_and(ds_W_full.XC>lon_min, ds_W_full.XC<lon_max ),
                           np.logical_and(ds_W_full.YC>lat_min, ds_W_full.YC<lat_max)), ds_W_full.Zl>depth_lim)
ds_W = ds_W_full.where(sel_area_W, drop=True).resample(time='24H').nearest(tolerance="1H")

ds_W.to_netcdf(PATH+'raw_data/ds_W.nc',engine='h5netcdf')

rabernat commented 1 year ago

Hi @abodner - thanks for posting here! Welcome!

Unfortunately the ECCO data portal is just not reliable or fast enough to facilitate this volume of data transfer. The best bet is to use rclone to move the data off of Pleaides. In order to do that, you need an allocation on that computer, which I'm assuming you don't have.

Fortunately, we are working on creating a mirror of more of the LLC data on Open Storage Network. We'd be happy to prioritize transferring the data you need. I'm cc'ing @dhruvbalwada and @rsaim who are working on this project.

Would you be available to join a call tomorrow at 10am to discuss in more detail? We'll be at https://columbiauniversity.zoom.us/j/92320021983?pwd=RmJ2TngxYTNrM0Fwd0ZYVDBNOUsrZz09

abodner commented 1 year ago

Hi @rabernat! Thanks for your reply and willingness to help out. I would love to join the call tomorrow and discuss further. See you then and thanks again!

Shirui-peng commented 1 year ago

Hi all -- I'm trying to load the llc2160 data in a similar way. In particular, I want to subsample the temperature and salinity data with one-year, full-depth coverage at selected grid points around the Kuroshio region. Here are lines of example code. I would appreciate any suggestions on how to do this efficiently. Thanks in advance!

import xmitgcm.llcreader as llcreader model = llcreader.ECCOPortalLLC2160Model() n = 413 ds = model.get_dataset(varnames=['Theta','SALT'], iter_start=92160+n1920,iter_stop=92160+(n+365)1920,iter_step=1920) pT = ds.Theta.isel(face=7,i=1600,j=320).values

timothyas commented 1 year ago

Hi @Shirui-peng, sorry for a long silence. Is this still an issue? Do you need a larger spatial region, and do you need all vertical levels? If you need a larger horizontal area, I would access the data with the entire horizontal slice you need each time rather than looping over i,j values. If you need all vertical levels, or a subset, I would increase the k_chunksize parameter, making it as large as possible to fit into memory. The default is 1, which would be inefficient if more depth levels are needed but a small horizontal region is required.

Finally if you have other computations that will reduce the dataset it could be good to include those before calling the values into memory, with .values as in your last line.

Shirui-peng commented 1 year ago

Hi @timothyas, thank you for the response and help! Idealy, we will need all grid points that are nearest to 20+k Argo profile locations in the Kuroshio region. And we need all vertical levels but want to reduce the vertical dimension with some mode-weighted averaging. Inspired by your insights, it seems to me that one way is to access the entire horizontal slice with large enough k_chunksize parameter at each time step. And it will include the vertical averaging computation before calling the values. Do you think this is a reasonable approach?

timothyas commented 1 year ago

I think that makes sense to me. Please let us know how it goes

MITgcm / xmitgcm

Unstable connection using llcreader #317