eurec4a / eurec4a-intake

Intake catalogue for EUREC4A field campaign datasets
17 stars 19 forks source link

Dropsonde reorganisation #20

Closed tmieslinger closed 3 years ago

tmieslinger commented 3 years ago

I rearranged the dropsonde dataset links such that the datasets can be accessed through P3 and HALO respectively. Also, I updated the urlpath to the current version v0.7.0 available via IPFS. I hope that the data access through platform_id.dropsondes.JOANNE.levelx is intuitive and can easily be extended with further dataset levels or other dropsonde data products.

d70-t commented 3 years ago

I like the idea.

The failing tests seem to be Aeris again. I'll re-run them tomorrow when the servers are hopefully back online again.

RobertPincus commented 3 years ago

@tmieslinger This looks cool. I don't understand the access method - can you explain for my information? Also, do I understand that dropsondes will be accessed as platform_id.dropsondes.JOANNE.levelx and never by dropsondes.JOANNE.levelx? Is there a reason we need to pick one approach or the other? Seems like both could be valuable.

tmieslinger commented 3 years ago

Hi Robert, thanks for having a look at it! Tobi and Geet linked the JOANNE level3 dataset via IPFS. So by levelx I mean level3 as currently only this dataset is linked. However, it could be possible to have level2 and others as well. I am thinking of heating rates or other derived products that might be available at some point from a different source - next to JOANNE - that could still be found by going through the platform_id.dropsondes.new_product keys. You are absolutely right that the platform_id is not absolutely necessary as JOANNE is a separate dataset. The idea behind the proposed setup is that the dropsondes are instruments from P3 and HALO platforms and should be findable through the platform_id. But I agree, this is debatable. Howe do you think about it after reading this? :)

RobertPincus commented 3 years ago

@tmieslinger I like the ideas of finding the dropsondes and possible other products through the platforms. I wonder if it's more flexible or more confusing (or both) to also have the dropsondes available directly, i.e. as dropsondes which one could then sort by platform. I bet @d70-t has thoughts on this.

RobertPincus commented 3 years ago

@tmieslinger PS as part of the PR can you please update the README so it no longer refers to the denby.io which hosted the dropsondes?

d70-t commented 3 years ago

By access method, do you mean IPFS?

That's a not-anymore-so-new-but-not-very-widely-known protocol for sharing data ipfs.io. We are experimenting with it a little bit and the newest JOANNE dataset was out first candidate for it :-)

The great plus of that method is that the hash-links refer to the content and not to the storage location. This is done in a fully decentralized manner, so in principle, as long as anyone has the dataset you'll be able to get it, no matter if the primary server is offline or not. Also if multiple others have the dataset, you'll be able to fetch it simultaneously from multiple sources and you should be able to get some performance improvements by fetching the data from nearby servers. Thus, I think that zarr-over-IPFS could be a very promising way of accessing data.

That said, native IPFS support is not yet available for python, but there's the possibility of running a HTTP-to-IPFS gateway locally on your machine. The current implementation of my little ipfsspec python library just tries to contact the local gateway and if it doesn't find it, it'll fall back to one of some public ones. That of course defeats a lot of the promised performance benefits, so having a local (or close) gateway is preferred.

Another word of caution would be that while I am quite surprised how stable the current IPFS releases seem to work, the currently have not yet released something like version 1.

RobertPincus commented 3 years ago

@d70-t I guess the use of IPFS introduce more dependencies? Is it anything more complicated than conda install ipfs or similar?

RobertPincus commented 3 years ago

@d70-t Because it seems heavy and indirect to ask someone to run a gateway service to access data...

d70-t commented 3 years ago

I think that having all instruments of a platform organized within that platform is a very natural choice.

But JOANNE is kind of special, as it is a multi-platform dataset. So maybe having a direct access method would be interesting as well. But I couldn't yet think of the perfect place to put it.

d70-t commented 3 years ago

@RobertPincus So currently the only thing you would have to do is pip install ipfsspec. Which would then access the data via a public gateway, which from the users perspective is just a bunch of http-requests to a zarr-dataset.

If you additionally run a gateway locally, those requests will stay on your own host and will be forwarded directly via IPFS from there.

RobertPincus commented 3 years ago

@d70-t The use of ipfs seems just fine, then.

RobertPincus commented 3 years ago

@tmieslinger @d70-t JOANNE and the radiosondes (and data derived from them, like the radiative profiles) are similar in being cross-platform.

I propose "both" - that is, the data should be available both at the top level (cat.dropsondes) and via the platform (cat.HALO.dropsondes) where the latter would ideally point only to the dropsondes from the platform.

d70-t commented 3 years ago

I would like to see the catalog converging to something which has a more regular (or predictable) structure, which currently in my head says that there should be one level which contains all platforms. Thus, I am wondering if we'll create a mess if we put all cross-platform products next to the platforms. But on the other hand, organizing everything strictly might not really fit what we need.

To properly do this organization, we might have to put all of those into a hierarchy:

But for many datasets it is sensible to skip some of those things. Skipping those things makes the paths more unpredictable but forcing dropsonde or AXBT data to be split into flights seems to be quite silly... Thus :man_shrugging:

On the other hand, if we'd say we like to stick to that order but can skip pieces in between, then cat.dropsondes.JOANNE whould be a natural choice as well.

tmieslinger commented 3 years ago

@RobertPincus I added a note on IPFS in the README but did not delete the minio server by @leifdenby as I am unsure whether he might use or want to use it for other datasets.

@RobertPincus @d70-t would it harm to keep the cat.dropsondes.JOANNE accress in parallel to platform.dropsondes.JOANNE? And Robert, would you agree to having JOANNE explicitly stated to seperate it from further possible dropsonde datasets?

d70-t commented 3 years ago

Pointing to sondes only from a platforms seems to be the more logical choice if the data is referenced from within the platform. But then, this would also require to duplicate the dropsonde dataset... And requesting only parts of the dataset should be possible if the data is accessed via opendap or zarr... But still, it is kind of weird if one would have to write something like

ds = cat.HALO.dropsondes.JOANNE.level3.to_dask()
ds = ds.isel(sounding=ds.platform=="HALO")
d70-t commented 3 years ago

@tmieslinger I think, keeping Leif's server in there is better. Also I am really not sure yet if zarr-over-IPFS is here to stay :grimacing: :laughing:

RobertPincus commented 3 years ago

@d70-t I agree that predicability is a virtue. The discussion has persuaded me that multi-platform datasets belong next to the platforms, especially since our examples can then be sorted by platform. As there is not an easy way to point e.g. cat.HALO.dropsondes.JOANNE.levelx only to the HALO dropsondes I propose that they belong in cat.dropsondes.JOANNE.levelx which I guess is inconsistent with @tmieslinger's reorganization.

@tmieslinger For the moment I find using cat.dropsondes.JOANNE to be a bit redundant. On the other hand I can see that another dropsonde dataset might be produced. What if we imagine that datasets might be multi-platform, multi-instrument, or both? Then perhaps the instrument level of the hierarchy is redundant and we would use cat.JOANNE, so the product and level elements of the hierarchy.

Or am I simply going backwards?

d70-t commented 3 years ago

I think the first rule which we should use as a guidance is that the names should be chosen in a way which makes it really unlikely that a move is required later on, because that requires to change client-side scripts, which potentially makes a lot of people unhappy.

For JOANNE, this at least requires that we have the level, because there are multiple of those. This alone necessitates at least one change in naming, because the level3 data should not be named cat.dropsondes if there will be L1,L2 and L4 which may only be available in the future. So it is at least cat.JOANNE.level3. Which I wouldn't consider a step backwards.

I can't really make a strong case for cat.dropsondes.JOANNE.level3, but I have two feelings about it:

The biggest question is, how can we anticipate if a path is chosen too coarse to be future-proof? And how do we decide that as we are integrating more and more datasets? -- For JOANNE specifically, I'd say it is really unlikely that we'll be getting another thing which has the same name though.

For cat.HALO.dropsondes.JOANNE.level3 I am inclined towards it doesn't hurt. But based on the above, there's also an argument against it, which would be the case that indeed at some point there might be a dataset which only includes HALO's sondes, which then should take exactly that name.

I'd specifically exclude versioning from this discussion as git already covers that case and people can link to older versions of the catalog already.

RobertPincus commented 3 years ago

@d70-t If we adopt the principle that instruments are included in the hierarchy when applicable, then we arrive at cat.dropsondes.JOANNE.level3 with room to grow both JOANNE and dropsonde categories. This is also consistent with the existing cat.radiosondes.bco etc where we extend your idea of flight/trip to include launch_location.

We would pass on @tmieslinger's suggestion of putting the dropsondes under the platforms until/unless these catalog entries could point to platform-specific subsets of the data. Theresa, this is ok with you?

Worth thinking about, but not requiring resolution at this point, is whether multi-instrument, multi-platform data sets get an explicit level in the hierarchy or whether we skip the instrument level for such data sets.

Also worth thinking about is how to organize derived data sets - here I'm thinking explicitly of the radiative flux profiles computed from the radiosondes and dropsondes plus assumptions and a model.

tmieslinger commented 3 years ago

Hi again, @RobertPincus if we will have a catalog entry pointing to platform-specific subsets of the JOANNE dataset, I would think that this subset has a different name, i.e. not JOANNE, and therefore we won't have ambiguity in cat.HALO.dropsondes.JOANNE.level3. I hope I understood that correctly though.

@d70-t @RobertPincus it seems to me that the discussion is converging to having the dropsonde dataset JOANNE reachable through cat.platform_id.dropsondes.JOANNE and cat.dropsondes.JOANNE. In addition, further (derived) datasets handling dropsondes could be on the same hierarchy level as JOANNE. Is that ok with you?

RobertPincus commented 3 years ago

@tmieslinger I would say rather that we decided against having the dropsondes accessible though the individual platforms until/unless this would point to only the subset associated with the platform. But this is just my opinion

tmieslinger commented 3 years ago

oh ja, sorry @RobertPincus. I was slightly stuck with the idea of having the access to the JOANNE dataset similar to for example Heike Konow's UNIFIED dataset, which includes slightly different dropsonde datasets for HALO. So here, I would like to access the data via cat.HALO.dropsondes.UNIFIED and intuitively I thought that it would be nice to have JOANNE data accessible via cat.HALO.dropsondes.JOANNE. So this is the platform perspective, while I can also see the benefits of having a direct link to JOANNE cat.dropsondes.JOANNE. Do you think that having both references to JOANNE could be confusing or is it simply bad practice?

d70-t commented 3 years ago

So what I get from the above is that we all agree that cat.dropsondes.JOANNE.level3 should exist. The two entries cat.HALO.dropsondes.JOANNE.level3 and cat.P3.dropsondes.JOANNE.level3 are still on debate. They should for sure exist if a subset of JOANNE for each platform would be made, but I don't see that coming. They could exist as well if there is a very good reason to have it. But they should not exist if they are useless, missleading or likely to change behaviour in future.

As established references should be maintained, it may be easier not to create additional references. Thus, to make some progress, I'd propose that we do only cat.dropsondes.JOANNE.level3 within this PR, such that we'll get a likely stable reference to the JOANNE dataset for the future. If later on we find an example for which the access via cat.dropsondes.JOANNE.level3 is really inelegant and accessing through the platform solves that, we would create another PR which would implement cat.HALO.dropsondes.JOANNE.level3 and cat.P3.dropsondes.JOANNE.level3.

RobertPincus commented 3 years ago

I concur with @d70-t's thinking.

tmieslinger commented 3 years ago

Me too :) I added the changes accordingly. Sorry for the lengthy discussion, I learned a lot from it programming