atmdrops / pydropsonde

0 stars 6 forks source link

Circle means for variables #50

Open hgloeckner opened 1 month ago

hgloeckner commented 1 month ago

I guess it would be useful to store the circle mean data for the variables in Level 4 as well. However, if we do, it would make sense to somehow weight the sondes depending on there distance to other sondes (or something similar).

Maybe we can brainstorm a bit how to best do that

Geet-George commented 1 month ago

Hi Helene, I agree. Indeed the circle mean data are very useful. I haven't yet looked at details of how L4 is being adopted from JOANNE, but there it is already included and the weighting is inherent in the regression. So, at every altitude, we will have an oversolved system, which means that for a least-squares fit, we get an estimate of the mean and gradients along x and y, which due to how the regression works already accounts for the differences in distances between sondes.

hgloeckner commented 1 month ago

Ah nice, we don't have a real level 4 yet, I just do some calculation (code is here), but I did not do the regression yet.

Also, in this I changed the structure a bit, because to me it does not really make sense to change the dimensions for everything; i.e. right now, level4 is just a extension of level3 with circle data added (and two new dimensions for the variables in circles being flight_id and position; which probably should be changed to circle_name). So currently there is ta[sonde_id, alt] as well as omega[flight_id, position, alt] in the same dataset. There are also variables added flight[sonde_id] and c_name[sonde_id], which hold the information for each sonde which flight and circle it belongs to

To me that made sense, because sonde_id is still a unique identifier and a circle should definitely be defined by the circle name and flight_id. All other information can be derived as well (and the regressed mean variables would also get dimensions flight_id and position).

ninarobbins commented 1 month ago

@hgloeckner the way I would do it is to have only (circle, alt) as dimensions in the level 4 data, since one still has access to the level3 which contains the sonde information, and the flight segmentation can be used to find out which sondes belong to which circle. That way the level 4 contains only the circle products, and if someone wants the sonde-by-sonde information they would have a look at the level 3.

Geet-George commented 1 month ago

it does not really make sense to change the dimensions for everything; i.e. right now, level4 is just a extension of level3 with circle data added (and two new dimensions for the variables in circles being flight_id and position; which probably should be changed to circle_name).

The idea behind L4 is supposed to be data that contains only area-averaged properties (and relevant ancillary variables). Therefore, the dimension would have to change from sonde_id to circle (or something else). In fact, the dimensions always change going from L2 to L3 to L4 (see Tables 7,8 and 9 in JOANNE), because for L2 it is the original independent dimension, i.e. time, for the gridded L3, it will be sonde_id and alt and for L4 it will be circle and alt. I would try and keep the dimensions names also the same as EUREC4A unless there is really a need for it. It is best to have consistency among datasets as much as possible.

Geet-George commented 1 month ago

@hgloeckner the way I would do it is to have only (circle, alt) as dimensions in the level 4 data, since one still has access to the level3 which contains the sonde information, and the flight segmentation can be used to find out which sondes belong to which circle. That way the level 4 contains only the circle products, and if someone wants the sonde-by-sonde information they would have a look at the level 3.

I agree with this. In fact, one can simplify it even further for the user to not have to go back to the flight segmentation files and add a variable in L4 itself. This is what we did in JOANNE. Quoting from the JOANNE paper, _"The list of sonde IDs included in every circle is included as a variable along dimension sonde_id, making it easier to retrieve data_ (from L3) for the individual soundings in the circle."

hgloeckner commented 1 month ago

Hm.. I am not so sure. A circle is uniquely defined by its flight id and name. Merging that to one dimension seems a bit odd for this campaign, as one major thing is to have those center, south and north circles in many flights.

Also, from talking with Julia it is a common use case to select all individual sondes for a circle after looking at the mean. I also don't really see a disadvantage of having the level 3 data in the same dataset (plus a flight_id and c_name variable, which could be combined with a single circle-id) the dataset is small enough that it would be feasible

Geet-George commented 1 month ago

I am not completely sure I understand what merging the circles to one dimension means. All the center, south and north circles will be part of Level-4. Each circle will be a coordinate in the circle dimension. Therefore every circle will have its area-averaged (by regression) values with its coordinate being the circle-id. Could you outline the structure of how you envision Level-4?

Geet-George commented 1 month ago

Also, from talking with Julia it is a common use case to select all individual sondes for a circle after looking at the mean.

Yes, of course, that is exactly why having a variable that just lists all sonde_ids in a circle is useful, so that it is easy to access from L3, with something like:

sondes_from_my_circle = l4.sel(circle = my_circle_id.sonde_id)
l3.sel(sonde_id = sondes_from_my_circle)

One could also think of having the L3 data in L4 itself, but that just is redundancy of data and creates potential points of error during development. I don't see the use of duplicating L3 in L4.

hgloeckner commented 1 month ago

Also, from talking with Julia it is a common use case to select all individual sondes for a circle after looking at the mean.

Yes, of course, that is exactly why having a variable that just lists all sonde_ids in a circle is useful, so that it is easy to access from L3, with something like:

sondes_from_my_circle = l4.sel(circle = my_circle_id.sonde_id)
l3.sel(sonde_id = sondes_from_my_circle)

One could also think of having the L3 data in L4 itself, but that just is redundancy of data and creates potential points of error during development. I don't see the use of duplicating L3 in L4.

I think from a users-perspective it's nice to have all data together, especially the circle products and individual sondes. And using an ds = ds.where((ds.c_name==<name>) & (ds.flight==<flight>)) is a easier to select the sondes for a circle than opening another dataset and combining them (and also enables people to get all sondes for the circles labeled 'south', which might not have been a usecase in EURECA). We sometimes forget that other people are not deeply into the datastructure and what happens where. Judging from the questions here, people open level4, look at the circle products and wonder what individual sondes look like.

I would wager the benefits of easier usability higher than the risk of errors (especially since we are only adding circle data and not meddling with the original sonde data between lev3 and lev4). The redundancy would be a point if our circle+sonde data was bigger than 100MB, but as it is I think this duplication is not too bad.

hgloeckner commented 1 month ago

I am not completely sure I understand what merging the circles to one dimension means. All the center, south and north circles will be part of Level-4. Each circle will be a coordinate in the circle dimension. Therefore every circle will have its area-averaged (by regression) values with its coordinate being the circle-id. Could you outline the structure of how you envision Level-4?

In my first lev4 I had two circle dimensions (flight_id and position), but you are right - just having a circle_id and maybe having the other two as variables is better