eurec4a / eurec4a-intake

Intake catalogue for EUREC4A field campaign datasets
17 stars 19 forks source link

Platform tracks #55

Closed observingClouds closed 3 years ago

observingClouds commented 3 years ago

As discussed in #38 the tracks of the platforms shall not be gathered in one folder but should rather be distributed in platform specific folders. Since the platform tracks are now all available on AERIS I wanted to make sure, that this distribution is still common sense before I create several top-level folders.

For platforms that have twins, I imagine to just create one top-level folder and have the exact identifier as a parameter. E.g. create one top-level folder SWIFT and have a main.yaml in there with content like

  track:
    args:
      urlpath: https://observations.ipsl.fr/thredds/dodsC/EUREC4A/PRODUCTS/TRACKS/EUREC4A_tracks_{platform}_v1.0.nc
      auth: null
      chunks: {}
      engine: netcdf4
    driver: opendap
    description: SWIFT tracks
    metadata:
      tags:
        - track
    parameters:
      platform:
        description: specific SWIFT platform
        type: str
        default: SWIFT16
        allowed: [SWIFT16, SWIFT17, SWIFT22, SWIFT23, SWIFT24, SWIFT25]

Any opinions @d70-t @RobertPincus ?

d70-t commented 3 years ago

Thanks @observingClouds for pushing this forward :+1:.

First one thing which might be clear, but it is probably good to re-state: in my opinion, the file / folder structure in this repository doesn't really matter. It matters a bit, because somehow an editor of this repository must be able to find the definitions to edit, but it should not matter for any user, because I would expect that a user never really sees the content of this repository. The one most important thing is the structure a user sees when accessing the data via the catalog. If the file structure aligns to that, that is a (very nice) bonus.

The next aspect which I think is important is that things need to be unified where possible in order to make life easier for users accessing the data (not for creating the data as hopefully, the data will be accessed more often than created). There are of course multiple possibilities to create a clever dataset hierarchy, and it is quite possible that all of them are bad in some sense. However, I really want to avoid mixing more than one hierarchical concept within one catalog as long as possible. As we have already opted for putting things which clearly belong to a platform below the platform, we should stay with this. Thus, it should be cat.<platform>.track and not cat.track.<platform>.

Based on the same reasoning, as a user, I would expect to obtain track data from any platform using the same method. According to the naming scheme above, I would expect that the following is possible for all platforms:

platform_id = get_platform_id_from_somewhere()
trackdata = cat[platform_id].track.to_dask()

This is (a) another reason why platform ids must be unique and (b) a reason against using the id as a parameter. Have a look at how the code would have to look like if the twin-solution would be in place:

platform_id = get_platform_id_from_somewhere()
if has_twins(platform_id):
    trackdata = cat[group_of_twins(platform_id)].track(platform=platform_id).to_dask()
else:
    trackdata = cat[platform_id].track.to_dask()

This looks a lot more complicated and most likely someone starting to analyze e.g. P3-data will not get it right before discovering that SWIFT16-data might be interesting as well.

So this is my opinion, I am happy to see more comments 😃

observingClouds commented 3 years ago

First one thing which might be clear, but it is probably good to re-state: in my opinion, the file / folder structure in this repository doesn't really matter. It matters a bit, because somehow an editor of this repository must be able to find the definitions to edit, but it should not matter for any user, because I would expect that a user never really sees the content of this repository. The one most important thing is the structure a user sees when accessing the data via the catalog. If the file structure aligns to that, that is a (very nice) bonus.

I'm just hesitant to create 59 (!) folders and think in this case it might be beneficial (for the repository) to group platforms of similar type together e.g. seagliders, buoys,...

d70-t commented 3 years ago

I don't see a fundamental problem in creating 59 folders. Especially as usability counts more than create-ability. However, there may be some (small?) performance implication when many individual catalog files have to be requested from the server. One way out may be the use of the nested yaml catalog plugin. Using that we could potentially specify a whole hierarchy within one catalog file. The downside would be pulling in another dependency.

If it is only about the folders within the repository, then a way out may be to group the files in subfolders which are different from the hierarchy presented by the intake catalog. I.e:

cat.SWIFT16 -> /SWIFT/SWIFT16.yaml
cat.SWIFT17 -> /SWIFT/SWIFT17.yaml
cat.SWIFT22 -> /SWIFT/SWIFT22.yaml
cat.HALO -> /HALO/main.yaml
...

this would reduce the number of folders in the top-level (and probably also in general). But probably makes editing the catalog less obvious.

observingClouds commented 3 years ago
cat.SWIFT16 -> /SWIFT/SWIFT16.yaml
cat.SWIFT17 -> /SWIFT/SWIFT17.yaml
cat.SWIFT22 -> /SWIFT/SWIFT22.yaml
cat.HALO -> /HALO/main.yaml
...

So hatte ich es mir jetzt vorgestellt 👍

RobertPincus commented 3 years ago

Not to be a stick in the mud, but I wonder if all the flight track data need to be carried separately. At least for the platforms for which I'm responsible the position information is now carried in other files (e.g. in cat.P3.flight_level) What's the motivation for having it in another separate place?

RobertPincus commented 3 years ago

Or might we just point the entires in the track level to the corresponding entries in e.g. flight_level?

d70-t commented 3 years ago

Well, that's probably a whole new discussion, but probably interesting. I think the intent of the track dataset is to provide a simple facility to quickly get an overview of all the platform positions, using code which is independent of the individual platform. So what we want is probably a function like (@observingClouds correct me if I am getting the intent wrong!):

get_track_data(platform_id, start=None, end=None) -> standardized Dataset

There are at least four possibilities to get to that point:

(a) has the advantage that this makes it really easy for users and has the disatvantage that it duplicates data and possible removes a bunch of the fine details. This is what this issue suggests. A corresponding implementation would look like:

def get_track_data(platform_id, start=None, end=None):
    return cat[platform_id].track.to_dask().sel(time=slice(start, end))

(b) and (c) have the advantage that all the details can be kept and that data is not unnecessarily duplicated.

A corresponding implementation for (b) would probably look like:

def get_track_data(platform_id, start=None, end=None):
    dataset_references = find_primary_location_datasets(platform_id, start=start, end=end)
    raw_concat_ds = xr.concat([open_dataset(ref) for ref in dataset_references], dim="time").sel(time=slice(start, end))
    return reformat_according_to_conventions(raw_concat_ds)

This would require some proper metadata sets which uniformly describe how we can translate from the generic "dataset of this platform" to "this dataset link". I'd like to have this and am thinking about how this might be done properly, but there are many subtleties to fall over, i.e. was the platform operating continuously or not. Are files split by leg, flight_id, day or the whole mission etc...? Are the files associated with the platform or are they associated with a sensor on the platform?

A corresponding implementation for (c) would probably look like:

def load_and_format_P3_tracks(start, end):
    ...

TRACK_LOADER_METHODS = {
    "P3": load_and_format_P3_tracks,
    "HALO": load_and_format_HALO_tracks,
    ....
}

def get_track_data(platform_id, start=None, end=None):
    return TRACK_LOADER_METHODS[platform_id](start, end)

This is the path we went with the flight-segmentation datasets. It requires a lot of user code which must be made available to everyone.

(d) will probably never work 😀


There are however more things to consider. If a user is interested only in getting an overview, chances are high that the user doesn't want to have the full resolution dataset, so probably there should be different levels of precision. As far as I have seen it up to now, the tracks dataset are at a really low temporal resolution, so it would fit the place of getting an overview, but users most likely don't want to use the dataset for "real science"...

Also not all users are using Python... while we heavily lean on the pythonic way of doing things, I'd try to keep the possibility in mind that it might become important to access our metadata structures using another programming language.


So my impression is that it is kind of nice to have this simple facility, based on reformatted datasets. We could probably argue if the name track should be used for artificially downsampled data or if such a prominent name should be redirecting to the best available data. But then, how do we cope with data which is split up into segments?

observingClouds commented 3 years ago

From a user experience point of view, you probably would like to access all the data of a platform just by cat.<platform> and then select which geotemporal slice and variables you would like to look at. In this case, all the data is on the same geo-temporal grid and the user does not have to worry about locating the data. So from that point of view, it would be great if all the other instruments e.g. onboard the P3 like the radar, are within the same file and on the same geo-temporal coordinates.

However, since the datasets are not released at the same time and sometimes might not even be geo-located, the separation into variable sets might lead to more consistent access patterns. The P3 flight_level entries contain e.g. temperature, probably because this is regarded as auxiliary data that is provided by the onboard sensors. Shall a glider that potentially measures several temperatures of the air and ocean include those in its flight_level entry? Those data might need some quality checks or further processing that would delay the publication of the whole dataset, such that smaller entities would be more beneficial to foster analysis and lead to more informative accessors e.g.

cat.<platform>.track
cat.<platform>.meteorological_state
cat.<platform>.oceanographical_state

We could (should?) open a discussion on the different accessors (@d70-t point d 😬 ). However, I would also hope that we might go even beyond the actual catalog structure and are just able to search for the datasets we're interested in, which of course only works if consistent tags are used. Currently, we don't have much tags though, and cat.search() doesn't seem to take the tags in the catalogs into account.

This works though :)

>>> list(cat.search("track"))
['ATR.track', 'HALO.track']

Anyway, I think the track data as I want to include it is at least valuable for quick evaluations, but to geo-locate some measurements especially those from fast moving platforms, additional data is necessary. So the tracks are a product ;)

RobertPincus commented 3 years ago

So many things to think about...

  1. As this is repository for the intake catalog isn't it ok to be Python-centric?
  2. For both the P3 and the Ron Brown there are multiple sets of files. Often these have different time coordinates according to the measurement frequency. We've decided that each will contain geolocation information (I didn't much like this but I see why it's the most simple).
  3. I agree with the idea of tracks as a product, and then it's sensible to have them as separate files with their own catalog entry.
d70-t commented 3 years ago

So as far as I can see, currently the only really python centric thing in this repository is the way how string interpolation is done in formatting parameters into parts of an url. All the other pieces of the catalog seem to be very generic and could in principle be implemented easily in other languages as well. I thought I had seen some posts about people doing this, but I didn't find it anymore. Also apparently people think of moving interopolation into an optional extension, which emphasizes that this is not the "core" of intake.

I think it may be very desirable to have a language-agnostic catalog in the long run (be it intake or not), as this would for example facilitate building some JavaScript which let's you directly look into a dataset without the need of preparing the dataset in a special way.

Apart from this, I agree as well :+1: