eurec4a / eurec4a-intake

Intake catalogue for EUREC4A field campaign datasets
17 stars 19 forks source link

Name collision for SWIFT buoys #62

Closed RobertPincus closed 3 years ago

RobertPincus commented 3 years ago

There are currently two sources of information for the six SWIFT buoys. I originally introduced a directory for the complete data store, so that the data would be accessed e.g. through cat.swifts.SWIFT16. Commit 574fdd2 introduced track data (from Bjorn?) for many (?) platforms; these are available as cat.SWIFT16.

Seems like it would be better to homogenize these in some way. Looking ahead we may want to plan for distinguishing the track data from more complete datasets.

RobertPincus commented 3 years ago

@observingClouds I would be interested what you think.

d70-t commented 3 years ago

The problem here has been discussed a bit in #59. Using the previous scheme of accessing "all" swift data through cat.swifts.SWIFT16 leads to the situation where it is not possible to add more datasets to the SWIFT16 platform. If in stead the data would be accessed using cat.SWIFT16.all, it would be compatible with cat.SWIFT16.track and the like, and thus more open to further advancements of the data collection. So that is what I would currently have on my mind.

d70-t commented 3 years ago

I am not sure if the intake catalog hierarchy is a good place to distinguish between things. The hierarchy on its own carries very little information and it will always be subjective (at least partially) as there is no way of defining the one true best hierarchy.

RobertPincus commented 3 years ago

@d70-t Agreed that the intake catalog hierarchy doesn't carry much information, but we want to make it easy for users to predict where data are likely to be found.

For the SWIFTS I am happy to adopt the idea of track and all. I might add the measurement frequency if this also distinguishes them.

A second question would be whether to treat each SWIFT as an independent platform, as the tracks currently do, or to treat them as a collection, as the complete data currently do. I don't have a strong opinion (maybe @observingClouds or @tmieslinger do) and will adopt whichever idea makes sense - but adopting either convention will require a change.

d70-t commented 3 years ago

The current idea is that there is a platform hierarchy-level where each platform is parallel to each other. In my mind, if something can move independently, this qualifies as a platform, which would be a hint to do it as it stands now.


However, I am feeling less and less confident, if it is valuable to formally distinguish between the concepts of platform and instrument on a data structure level. And stating that each instrument belongs to a platform. This creates a couple of problems which are not visible at first sight, but become more and more problematic as we go on. Some examples would be:

My current thinking says that we probably should only have things in stead of platforms and instruments. Those things could be more platform-ish or more instrument-ish or both, e.g. by defining kinds in a similar way as we've done for the flight segments. And things could relate to other things. I'd assume that those relations would need to have attributes as well, e.g.:

One question would be, if these ideas would be a more sensible choice.


The issue would be if and how this kind of thinking could be mapped onto an intake catalog. As the catalog is inherently a hierarchy, I have some doubts if that would be possible. The question would then be, if we should (in the long run) generate a complex hierarchy from the suggested graph-like structure or if we should try to keep the intake catalog more flat, such that one could eventually just extract the interesting identifiers from the graph and use them for a query in a (relatively) flat catalog?

RobertPincus commented 3 years ago

@d70-t I'm not sure how careful we need to be in defining the intake catalog hierarchy. Its main value, it seems to me, is in making it possible to access data without concern for the details of the server, data format, naming convention, etc. Most users won't mind whether Poldirad is a platform or an instrument. Grouping the SWIFTS together has the advantage of reducing the top-level entries by five (6 vs 1) but either approach is sensible and the number of possible entries is countably small.

For the moment I propose to keep each SWIFT as a single entity at the top level and make two entries for each, one of the track and one for the complete set of data. Sound ok?

d70-t commented 3 years ago

Well, that's kind of contrary to what we did in #59. Of course, that's a bad argument, but maybe this one is a better one: A flat hierarchy is simpler than a hierarchy which uses somewhat arbitrary categories. If we group the SWIFTs together, we probably should do the same with drifters, saildrones, boreals etc... as well. Adding the extra level makes things like:

platforms = get_interesting_platforms_from_somewhere()
for p in platforms:
    plot_track(cat[p].track.to_dask())

quite a bit harder...

I'd also argue that the list of interesting platforms probably should not come from the intake catalog, as the catalog doesn't really know what a platform is.

That said, @RobertPincus I'd consider you to be way more an expert for the SWIFTs and their use cases, so I'll take your word as the final answer.

RobertPincus commented 3 years ago

@d70-t I think I'm proposing the same approach. "Each SWIFT" means there will be top level entries for "SWIFT22", "SWIFT17" and so on. Each of these entries will have to elements, "track" and "all" (or something more descriptive). Is this not consistent with #59? Do I miss something?

d70-t commented 3 years ago

Ah, of course... I mixed the two paragraphs in my head. Sorry for the confusion...

RobertPincus commented 3 years ago

Resolved, for the moment, by #63