Cloud-Drift / clouddrift

CloudDrift accelerates the use of Lagrangian data for atmospheric, oceanic, and climate sciences.
https://clouddrift.org/
MIT License
38 stars 8 forks source link

Range-aware subset #214

Open milancurcic opened 1 year ago

milancurcic commented 1 year ago

As discussed with @selipot today who proposed this idea.

Current implementation of subset is cloud-optimized for criteria that have a traj dimension, for example, subsetting by ID:

subset(ds, {"ID": [2578, 2582, 2583]})

However, subsetting by criteria that have an obs dimension, for example, subsetting by region or time:

subset(ds, {"lat": (21, 31), "lon": (-98, -78)})

requires downloading the entire variables that appear in the criteria to make the comparison locally.

However, if the range (min and max) of these variables were known, subset could subset by ID under the hood, thus effectively doing the subset by obs dimension in a cloud-optimized way.

clouddrift could propose the following requirement for cloud-optimized ragged arrays: Every numeric variable <var> with the obs dimension will be accompanied by the variables <var_min> and <var_max> with the traj dimension.

If the expected range variables are still not found in the dataset, subset could proceed to carry out the comparison as is in the current implementation.

selipot commented 1 year ago

@philippemiron I am curious to hear if you think subset could be made range aware as explained above?

selipot commented 1 year ago

Should we create a function that generate ranges for lat, lon, time and possibly other variables? How to deal with the wrapping longitude?

philippemiron commented 1 year ago

I'm not sure how that would work since most data is not sorted. So even if we know the ranges somehow, you will still have to download the complete variable ragged array. Or highly possible, I'm missing something.

milancurcic commented 1 year ago

Say for example the remote dataset has lon_min, lon_max, lat_min, and lat_max variables. You want to subset the remote dataset to get only the drifters inside GoM. The internal steps would be:

  1. Read lon_min, lon_max, lat_min, and lat_max (cheap, as these are traj-dimensioned variables);
  2. Extract ids for which the longitude and latitude bounds are within search range (cheap, as id is traj-dimensioned);
  3. Subset by id (cheap; this step is already cloud-optimized).

The key requirement for this to work is that the *_min and *_max variables must be present on the remote dataset. Do you see any issues with this logic?

philippemiron commented 1 year ago

I posted this on slack but I'll share here too. If you have those two trajectories, which have very similar lon_min, lon_max, lat_min, and lat_max, one goes in the box, one doesn't... it's not super clear how we could extract everything going thru a subregion from those bounding values.

image

milancurcic commented 1 year ago

Yes, the proposal wouldn't work for either partially in or partially out of the bounding box, but only for trajectories completely contained by the bounding box. I can see an argument that that's not actually what's needed in most cases.

philippemiron commented 11 months ago

Can we close this? Not sure there is a solution for this and downloading 1-2 variables is not a big deal even for the gdp.

milancurcic commented 11 months ago

I recommend keeping unresolved issues like this open because closing them hides the discussion and makes it more difficult to discover by newcomers to the library. We can mark it as #wontfix until we're ready to pursue it further.

But specifically for this issue, even though range attributes are not sufficient in many situations, like the one you illustrated above, they can be useful in many others.

Let's take for example the above 2 trajectories with similar bounding boxes, but one goes through the region of interest and other doesn't. Let's also suppose that there are 300 other trajectories in the dataset (e.g. GLAD) that are completely outside of the bounding boxes. If I try to subset the dataset for this region, it will download all lon, lat arrays first to make the query. If however we first query by bounding boxes, we can quickly eliminate ~300 trajectories and reduce the search space to only 2. And now when you query the region, you only download lat/lon for the 2 trajectories. Drastic example but it serves to illustrate the usefulness of this approach.

For GLAD of course it's not a problem because it's a small dataset. But GDP for example, I can't subset a small region on my 8GB laptop without killing the Python process due to out-of-memory.

milancurcic commented 11 months ago

I think the issue was that this step

  1. Extract ids for which the longitude and latitude bounds are within search range (cheap, as id is traj-dimensioned);

that I proposed above doesn't work for the reason that @philippemiron explained. However, I think it could be modified to

  1. Exclude ids for which the longitude and latitude bounds are outside of the search range (cheap)