eurec4a / how_to_eurec4a

Code examples to get you started with EUREC⁴A data.
https://howto.eurec4a.eu
MIT License
6 stars 20 forks source link

access to (ICON model) output and input #96

Closed felix-mue closed 11 months ago

felix-mue commented 1 year ago

I am interested in data from the ICON model runs (in- and output), but this question could be generalised to other data as well.

  1. Is the output from the ICON model runs publicly available? The eurec4a intake repository folder for ICON contains yaml files with links to dkrz, which I suspect to be the location of the output data, but I do not understand how to access them.
  2. Are the corresponding input settingsfiles publicly available as well? (They might be bundled directly with the output of course.)
observingClouds commented 1 year ago

Hi @felix-mue, Thank you very much for reaching out to us and raise your question here.

To 1) You can access most (if not all) datasets listed on the how.eurec4a.eu page via our intake catalog without even the need to know where they are stored 😎.

import eurec4a
cat = eurec4a.get_intake_catalog()
datasets = list(cat.simulations.ICON.LES_CampaignDomain_control)  # show all available entries of a catalog level
ds = cat.simulations.ICON.LES_CampaignDomain_control.surface_DOM01.to_dask()  # lazy loading of data

In addition, this will only download the data that you are actually using in your analysis (keyword: lazy loading). No need to download all the TB of output 🥳

Please try it out! Does this answer your first question?

observingClouds commented 1 year ago

To 2) The run-scripts are available at the experiment repository. Please let me know if you have access to those.

felix-mue commented 1 year ago

Thanks for the quick reply! Yes, I have access to the other repository. I will work through the data handling and the files and get back to you when something comes up.

felix-mue commented 1 year ago

About accessing the data: While lazy loading is great for many places, for me it would actually be helpful to have one big download of the data (maybe subset by variables). Is that available as well?

observingClouds commented 1 year ago

May I ask what your application is? The latency to access the files here should be fairly low and loading the data lazily ensures that you will always access the latest version.

At https://howto.eurec4a.eu/eurec4a_mip.html we show you how you can download data with wget. The paths you can find in the eurec4a catalog files, e.g. here

felix-mue commented 1 year ago

A simple barrier sadly: Our code is running in matlab, not python. So I have to access the data from matlab and assumed that isn't possible with the python package.

observingClouds commented 1 year ago

Sorry to hear that! Maybe it's time for a change 🥳 MATLAB supports yaml files so you could read those files and grep the links. But honestly it seems like you would need to invent the wheel again. MATLAB's python support might also be something to look into but I'd be surprised if it works well.

Another issue you might face with MATLAB is that the simulations are saved in the zarr-format. It seems like MATLAB has no dedicated driver for this format yet. However, zarr is now besides HDF5 also a supported backend of netCDF and is supported by the newer libraries. You should therefore be able to load the zarr-files (after downloading them) through the netCDF library. The syntax is however a bit unusual.

So, here is an example how you can download a zarr-file from the catalog and read it with the netCDF library:

  1. Download the data with wget

    wget -r -H -N --cut-dirs=3 --include-directories="/v1/" "https://swiftbrowser.dkrz.de/public/dkrz_948e7d4bbfbb445fbff5315fc433e36a/EUREC4A_LES/experiment_2/meteograms/EUREC4A_ICON-LES_control_meteogram_DOM03_BCO.zarr/?show_all"

    Note the change of the prefix and ending of the url compared to the one given in the catalog.

  2. Note that wget creates two directories (swift.dkrz.de, swiftbrowser.dkrz.de). The actual dataset is in swift.dkrz.de.

  3. Append the absolute path of the zarr file following the scheme: file:///path/to/zarr/file.zarr#mode=xarray

  4. You should be able to use this path with your favourite netCDF tool, e.g.

    ncdump -h "file:///path/to/swift.dkrz.de/experiment_2/meteograms/EUREC4A_ICON-LES_control_meteogram_DOM03_BCO.zarr#mode=xarray"

    Unfortunately, reading a variable from this dataset is for this particular case not working on my end. It might be that the used compressor is not supported (although it seems) or the blosc library (we use lz4 as a compressor here) is not linked to the netcdf library.

ncdump -v time "file:///path/to/swift.dkrz.de/experiment_2/meteograms/EUREC4A_ICON-LES_control_meteogram_DOM03_BCO.zarr#mode=xarray"

returns the metadata and then

data:

NetCDF: Filter error: undefined filter encountered
Location: file ?; fcn ? line 478
 time = % 
observingClouds commented 1 year ago

@d70-t do you have an idea what is going on here? The filter in .zmetadata/.zarray is actually null and deleting it entirely does not help to solve the problem.

d70-t commented 1 year ago

.zmetadata is a zarr-python extension, which as far as I know isn't adopted yet by netCDF. But .zarray is used.

You probably need netCDF >= 4.9 and there are some steps required for setting up netCDF to run with filters.

d70-t commented 1 year ago

Download the data with wget

If you really really want a download of a subset, I'd probably recommend to just open the data with intake / xarray, then do some ds[[vars...]].sel(...).to_netcdf(). But just as @observingClouds said, I didn't yet discover cases in which downloading would be so much better that it would justify the additional hassle involved.

felix-mue commented 1 year ago

Thanks a lot to both of you! I agree, of course I'd rather not download. I just didn't see a way to access it otherwise (within matlab).

I will try the cross-accessibility features @observingClouds mentioned, but I also don't have high hopes. I am also downloading some data simultaneously to try if that gets me further.

felix-mue commented 11 months ago

I ended up downloading the data with a python script to save them as netcdf files. This is of course unfortunate, because the pythonic way of accessing this data is way more comfortable! Thanks a lot again for your help and providing the data in the first place!