Add/Implement data source mirrors

observingClouds commented 3 years ago

Hi guys, I'm just in the process of uploading a new version of the radiosonde dataset. This time, it is not a tar archive, but the level1 and level2 data can be directly accessed through the AERIS THREDDS server.

@leifdenby do you want to update your zarr files, or change to the AERIS THREDDS server (https://observations.ipsl.fr/thredds/catalog/EUREC4A/PRODUCTS/MERGED-MEASUREMENTS/RADIOSOUNDINGS/v3.0.0/level2/catalog.html), or even better add both sources for a better availability in case a server is down.

I make an announcement in the data-channel, when the upload is final. Cheers!

leifdenby commented 3 years ago

thanks @observingClouds! I was actually thinking that maybe I should remove my zarr-based mirrors from the main repository and we just use AERIS directly instead. What do you think? I'm happy to keep my zarr-based catalog available, but maybe I'll put that on a separate repository that we can link to from this main one? Maybe in mirrors.leifdenby_zarr or something like that? What do you think @d70-t?

observingClouds commented 3 years ago

Well, as long as you could keep the files up-to date (and I don't see that I should reprocess them soon) and/or make sure they see which version they are using (DOI), it might actually be great to still have that resource in case AERIS is down. It would be great, if one could have several possible resources in the catalog and intake switches (semi-)automatically, but I guess this is not yet implemented? You guys probably know more.

d70-t commented 3 years ago

I think references to Aeris should go into the catalog. However, having an active backup is also a very good idea. There is already some progress in intake/intake#557 on providing multiple locations for one dataset, but it is not done yet.

Having a mirror structure could be an addition, but I am not so sure if we really want to have that. A result of this would be that users would have to specify some form of path manually again and most likely we'll end up in having a couple of scripts passed around which only access the "mirror" tree. This can become particularly problematic if the mirror is not complete, such that some datasets will effectively work only on the main tree while others will probably only work on the mirror tree...

leifdenby commented 3 years ago

So, in the meantime (before mirroring is available) we could just go ahead and replace the entry backed by my server with the data on AERIS? I think adding a data_mirrors for now might be quite nice to keep this "backup" available. Does that sound ok?

d70-t commented 3 years ago

Puh... I really find this one hard to decide.

mirroring is absolutely something we should have. The OPeNDAP endpoint at Aeris had an uptime of 67% during the last two weeks.
having more than one possible path to a dataset of which sometimes one and sometimes the other works kind of defeats the purpose of the catalog (which to my mind is saving the user from pasting in urls or custom root folders or the like)

I have to 🤷 and hope that others have better arguments.

leifdenby commented 3 years ago

having more than one possible path to a dataset of which sometimes one and sometimes the other works kind of defeats the purpose of the catalog (which to my mind is saving the user from pasting in urls or custom root folders or the like)

Ah yes, you're absolutely right. I hadn't thought of that. We could instead adopt a convention of adding {product}__mirror entries in the catalog? E.g. we'd have radiosondes/bco__mirror. It's not pretty, but at least it's "nearby" in the catalog tree, so should make it easier to find.

d70-t commented 3 years ago

We could instead adopt a convention of adding {product}__mirror entries in the catalog?

I don't know if this makes the situation better or worse... If we'e implement this, then a user would need to access the data using something like:

def reliable_to_dask(cat, entry):
    try:
        return cat[entry].to_dask()
    except:
        return cat[f"{entry}__mirror"].to_dask()

cat = eurec4a.get_intake_catalog()
### some more code
ds = reliable_to_dask(cat.ATR, "track")

This has the potential of not creating a ton of hard-coded cat = cat.mirror lines, but it also is not entirely beautiful. And if in stead people start to sprinkle around things like ds = cat.ATR.track__mirror18 or the like, this will become horrible.

eurec4a / eurec4a-intake

Add/Implement data source mirrors #26