Reading regional llc4320 files

kdrushka commented 3 years ago

Hi all,

I am trying to load regional llc4320 files (on Pleiades, or locally after downloading from the ECCO data portal). The files are binary but are not in MDS format.

I understand that xmitgcm cannot (yet) be used to read regional llc4320 files. I am wondering if anyone can point me to an example of code that does read these files, or if anyone is working on implementing this feature into xmitgcm.

Many thanks,

Kyla

rabernat commented 3 years ago

Hi @kdrushka, welcome to xmitgcm! 👋 (I swear there are other people on these forums, it's not just me talking to myself! 🤓 )

It sounds like your issue is very similar to #213. In that issue @isgiddy was trying to read the regional LLC files as well. We could try pinging her to see if she made any headway.

In the meantime, I'd like to dig into the details of the regional files a bit more. In order to read these files, with xmitgcm, plain python, or FOTRAN for that matter, we need to know what is inside! For the global LLC configurations, this information is basically hard coded into xmitgcm.

https://github.com/MITgcm/xmitgcm/blob/ea0a18e1ef2470669e381ae9c51e2c0b052071e9/xmitgcm/llcreader/known_models.py#L62-L68

As you noted, without some additional information (like .meta files), it's nearly impossible to decode these raw binary files. I see like 30 different regions in the ECCO data portal. At minimum, we need to know the shape of the grid (e.g. Nx, Ny, Nz). Do you know if any documentation like this exists? If we can figure that part out, the rest should be straightforward.

kdrushka commented 3 years ago

Thanks @rabernat! I will follow up on #213.

As you noted, without some additional information (like .meta files), it's nearly impossible to decode these raw binary files. I see like 30 different regions in the ECCO data portal. At minimum, we need to know the shape of the grid (e.g. Nx, Ny, Nz). Do you know if any documentation like this exists? If we can figure that part out, the rest should be straightforward.

An easy way to get the grid shape is from the hFacC* filename in the region/grid/ folder, e.g. in the CalSWOT2 region the file is called hFacC_768x774x87, so Nx=768, Ny=774, Nz=87. Straightforward enough to use that filename to generalize reshaping the regional binary data.

(Note that grabbing the timestamp for each data file is slightly trickier, as different regions use different naming conventions - e.g., for CalSWOT2 regional files, the timestamp appears as the prefix in the file names whereas for OKMC regional files the date appears at the end of each filename).

rabernat commented 3 years ago

It looks like the shape information encoded in the files names is all we need.

import fsspec
import numpy as np
from matplotlib import pyplot as plt

url = 'https://data.nas.nasa.gov/ecco/download_data.php?file=/eccodata/llc_4320/regions/CalSWOT2/grid/Depth_768x774'
with fsspec.open(url, mode='rb') as f:
    raw_data = f.read()

nx = 768
ny = 774
dtype = '>f4'
assert len(raw_data) == ny * nx * 4

shape = (ny, nx)
data = np.frombuffer(raw_data, dtype=dtype).reshape(shape)

plt.pcolormesh(data)

You could get started right now with that code snippet and start looking at the data. However, you will end up writing lots of boilterplate to put all of the individual data files together into a useful dataset.

If we can we get this wired into xmitgcm, that will all be automatic, and you will get back a nice Xarray dataset. There are two tricky coding tasks involved:

We need to generalize the llcreader module to deal with these rectangular grids. This should be relatively straightforward, since they are actually a lot simpler than the big weird LLC grids. However, it still requires a somewhat major refactoring of the code, which would not be easy for anyone but me or @timothyas.
We need to figure out how to deal with all the weird special cases related to the file names. Basically we need some kind of template syntax, like {varname}_{nx}x{ny} for Depth_768x774. @kdrushka if you are motivated to work on this, one thing you could do is make a spreadsheet documenting all of the different naming / formatting conventions for the different regions. This would be helpful and doesn't involve diving into the [ugly] guys of xmitgcm.

kdrushka commented 3 years ago

Awesome, thanks @rabernat. I am happy to document the formatting for each region. I will also check in with @menemenlis since I believe he has created most (all?) of the regional extractions.

menemenlis commented 3 years ago

Kyla and Ryan, the files were extracted using Bron Nelson's extract program (pfe:~bcnelson/MITgcm/extract/latest/). I am not python-literate but I can do a number of things to help: I can add readme.txt files that describe format, content, units. I can provide matlab example code to read them. And I could create MITgcm meta files to describe their content. Let me know know if any of the above would be useful.

kdrushka commented 3 years ago

Thanks, Dimitris. @rabernat, could you clarify: if @menemenlis can provide .meta files for the regional data, will it be possible to use llcreader as is, or would this require significant coding effort on your end? If the latter, I wonder if there will be enough use of the regional simulations to make the effort worthwhile.

For context: we have extracted llc4320 output in ~10 regions where field campaigns are planned as part of the SWOT Adopt-a-Crossover effort. The model output from these regions will be made available on PO.DAAC in netCDF format, and on the ECCO data portal in binary (same as the existing regional output) -- so, it would be great if xmitgcm could be used without a lot of extra coding effort to read the binary files, but no problem if not.

rabernat commented 3 years ago

Yes, .meta files would make this somewhat easier. But since the existing global LLC files don't use .meta files, it's not needed. More useful would simply be:

renaming the files to use a consistent naming convention that is identical to the global LLC data
providing a simple, machine-readable file specifying the grid size of each region, e.g. in CSV format
```
region,Nx,Ny,Nz
CalSWOT2,768,774,87
...
```

@kdrushka and @menemenlis: are you aware of the effort led by @lesommer with postdoc @roxyboy regarding model data for the SWOT AdAC effort?

https://github.com/roxyboy/SWOT-AdAC-ocean-model-intercomparison

They are planning on collecting regional simulation data from many models and hosting it in a cloud-based repository supported by Pangeo. It seems like a good idea to coordinate efforts here...

kdrushka commented 3 years ago

renaming the files to use a consistent naming convention that is identical to the global LLC data

providing a simple, machine-readable file specifying the grid size of each region, e.g. in CSV format

Yes, we can do that.

@kdrushka and @menemenlis: are you aware of the effort led by @lesommer with postdoc @roxyboy regarding model data for the SWOT AdAC effort?

https://github.com/roxyboy/SWOT-AdAC-ocean-model-intercomparison

They are planning on collecting regional simulation data from many models and hosting it in a cloud-based repository supported by Pangeo. It seems like a good idea to coordinate efforts here...

Yes! Our two efforts are quite complementary and we have discussed avenues for coordination.

lily-dove commented 3 months ago

Hi @kdrushka, I was wondering if you've had success with this! I'm trying to read some regional LLC4320 data and would love to know what's worked for others. Cheers!

roxyboy commented 3 months ago

@lily-dove I've read regional sea-surface data from LLC4320 in the past by

from xmitgcm import llcreader
model = llcreader.ECCOPortalLLC4320Model()
model.get_dataset(varnames=['U','V'], k_levels=[0], 
                               iters=iis,
                               type='latlon'
                              ).sel(
                                    j=slice(9555,10198),j_g=slice(9555,10198),
                                    i=slice(15355,15845),i_g=slice(15355,15845),
                                   ).isel(time=0,k=0,k_l=0,k_u=0,k_p1=0
                                         ).chunk({'j':200,'i':200,'j_g':200,'i_g':200})

where I specify iis as the iteration numbers I want.

MITgcm / xmitgcm

Reading regional llc4320 files #253