too many derived variables

bjorn-stevens commented 1 week ago

Geet and I went back and forth on this point. I'm not sure where I ended up, but at the moment I find myself a skeptical of the utility of too many derived variables in the level-3 data set, especially when there is some ambiguity as to what the values mean, which is the case for most.

For instance, is IWV or theta-e the value computed from the primary (non vertically interpolated) data or the value from the interpolated data. More generally, I find the inflation of variables in the level3 data cumbersome.

I would favor a single quality assessment, i.e., good or questionable, and the amount of postprocessing should be small enough to allow this assessment. It seems that BAHAMAS data is included in this dataset, if so that doesn't seem like a good idea. For interpolation I would use $q$ and $\theta$ regardless of what thermodynamics coordinates you decide to use in the end

I would thus prefer that level2 and level3 only differ by the use of a common grid for the latter, and that both be identified with physical dimensions as suggested in https://github.com/observingClouds/pysonde/issues/40 . Level 4 could be derived variables.

hgloeckner commented 1 week ago

At the moment, IWV and theta_e are calculated on the interpolated grid, but that is easy to change and I wasn't sure which way is better.

I know Geet had a similar point before... I don't really see why it is a problem to have more variables? The dataset is still very small, it's zarr and chunked, so if you don't need the variables, you don't have to look at them or make a list and just select the variables you want at the start:

my_vars = ['q', 'theta']
ds = ds[my_vars]

Regarding ambiguity, that's a temporary state - this dataset is not yet finished and I agree, in the end the attributes to the variables have to unambigously describe what's in there. For now, we had other things that seemed more important.

There will only be Good, bad (not usable) and ugly (partly usable) in the end, but right now this is not done completely and will come alongside the flight segmentation, I guess. On the other hand, I think it is useful to know, what the quality control tests said for each sonde - also to decide if that's useful for the specific usecase at hand.

Your wishes for physical coordinates and no-bahamas data are not feasible together. If we want to add launch-lat, launch-lon and launch-height as coordinates, we need to know where the plane has been (this is taken from the a-files of the sondes which probably ultimately is the bahamas stream; probably Geet knows that better). Theta and q are the variables that are interpolated. RH and T are re-calculated later.

Moving derived variables to level4 is no problem; from my understanding the sole change between level3 & level4 should be the addition of circle products, but that can be changed.

bjorn-stevens commented 1 week ago

Thanks, these help me understand the thinking. Some follow up:

Good, Bad and Ugly... my birth year :)
maybe you have a use case in mind which needs all the processing info. Based on that it would be easier to judge if it is worth including, or better to refer the use case to more primary data. After all why keep level n-1 if the goal of the metadata in level n is to effectively recreate level n-1.
one could use bahamas just for the launch, but one could also use the first sonde data point, neither speaks agains the sense of my comment.
great to interpolate theta and q. Does theta use the correct definition of the gas constant?
Speaking of which, there are many formulae for theta-e, the best ones being in my 2022 paper with Marquet. Same goes for most other moist variables. How will you deal with missing condensate that appears in all definitions?
same goes for RH, but that has the peculiar aspect that it is measured so it is hard to avoid when regriding. Still one has to make choices, so which one and why?

The main point I see is that the more clearly you can separate the data from the data products the better, and the less you mingle choices made in creating derived quantities with the presentation of the data itself. Of course there will be edge cases, we should just try to minimize them. Anyway, based on your last comment I sense that there is agreement there.

With that in mind, did you ever consider: level1 - sonde data (1a, aspen, 1b cleaned aspen, what is presently called level 2) level2 - regridded level 1b, what is presently called level 3 level3a,b,c,d -- various products (what is presently called level 4)

Identifying level3 with products might be worth considering, and there can be a variety of products: (a) spatial analysis; (b) derived variables (try to do them as correct as possible, which means one should't use metpy); (c) radiant energy transport.

As a meta point, transparency is the ultimate goal. Consistency can be a nice way to achieve it, but is nothing more.

bjorn-stevens commented 1 week ago

One other point I failed to respond to above. The first thing I do when opening a data set is try to understand what it presents. The more it presents the harder it is to understand. If on top of that it presents lots of redundant information it gets even more difficult to understand.

atmdrops / pydropsonde

too many derived variables #64