Open milancurcic opened 1 year ago
@philippemiron I am curious to hear if you think subset
could be made range aware as explained above?
Should we create a function that generate ranges for lat, lon, time and possibly other variables? How to deal with the wrapping longitude?
I'm not sure how that would work since most data is not sorted. So even if we know the ranges somehow, you will still have to download the complete variable ragged array. Or highly possible, I'm missing something.
Say for example the remote dataset has lon_min
, lon_max
, lat_min
, and lat_max
variables. You want to subset
the remote dataset to get only the drifters inside GoM. The internal steps would be:
lon_min
, lon_max
, lat_min
, and lat_max
(cheap, as these are traj
-dimensioned variables);id
s for which the longitude and latitude bounds are within search range (cheap, as id
is traj
-dimensioned);id
(cheap; this step is already cloud-optimized).The key requirement for this to work is that the *_min
and *_max
variables must be present on the remote dataset. Do you see any issues with this logic?
I posted this on slack but I'll share here too. If you have those two trajectories, which have very similar lon_min
, lon_max
, lat_min
, and lat_max
, one goes in the box, one doesn't... it's not super clear how we could extract everything going thru a subregion from those bounding values.
Yes, the proposal wouldn't work for either partially in or partially out of the bounding box, but only for trajectories completely contained by the bounding box. I can see an argument that that's not actually what's needed in most cases.
Can we close this? Not sure there is a solution for this and downloading 1-2 variables is not a big deal even for the gdp.
I recommend keeping unresolved issues like this open because closing them hides the discussion and makes it more difficult to discover by newcomers to the library. We can mark it as #wontfix until we're ready to pursue it further.
But specifically for this issue, even though range attributes are not sufficient in many situations, like the one you illustrated above, they can be useful in many others.
Let's take for example the above 2 trajectories with similar bounding boxes, but one goes through the region of interest and other doesn't. Let's also suppose that there are 300 other trajectories in the dataset (e.g. GLAD) that are completely outside of the bounding boxes. If I try to subset
the dataset for this region, it will download all lon
, lat
arrays first to make the query. If however we first query by bounding boxes, we can quickly eliminate ~300 trajectories and reduce the search space to only 2. And now when you query the region, you only download lat/lon for the 2 trajectories. Drastic example but it serves to illustrate the usefulness of this approach.
For GLAD of course it's not a problem because it's a small dataset. But GDP for example, I can't subset a small region on my 8GB laptop without killing the Python process due to out-of-memory.
I think the issue was that this step
- Extract ids for which the longitude and latitude bounds are within search range (cheap, as id is traj-dimensioned);
that I proposed above doesn't work for the reason that @philippemiron explained. However, I think it could be modified to
As discussed with @selipot today who proposed this idea.
Current implementation of
subset
is cloud-optimized for criteria that have atraj
dimension, for example, subsetting by ID:However, subsetting by criteria that have an
obs
dimension, for example, subsetting by region or time:requires downloading the entire variables that appear in the criteria to make the comparison locally.
However, if the range (min and max) of these variables were known,
subset
could subset by ID under the hood, thus effectively doing the subset byobs
dimension in a cloud-optimized way.clouddrift
could propose the following requirement for cloud-optimized ragged arrays: Every numeric variable<var>
with theobs
dimension will be accompanied by the variables<var_min>
and<var_max>
with thetraj
dimension.If the expected range variables are still not found in the dataset,
subset
could proceed to carry out the comparison as is in the current implementation.