Open selipot opened 4 days ago
Hi @selipot ,
I see what you want to accomplish, but I'm not sure how this could be automated, especially when dealing with xr.Dataset
. Of course, it's not impossible, but it becomes quickly complicated when adding coordinates with different dimensions.
For example, when you do:
segment_size = cd.ragged.segment(ds["time"], 3600, ds["rowsize"])
how the function is setup right now, is not simple to link this new dimension of len(segment_size) = 100235
with the previous dimension traj
= 19396.
Plus, when you get to the last step, print(np.sum(segment120_size))
, which is 195462963, is not equal to the length of the total obs
in the dataset. Again, it is not easy right now to remove the 2M obs
that are not included anymore.
For this to work, we would need to recreate a new dataset at each step to adjust the dimensions of obs
and assign the correct coordinates. As I said, it's feasible but would require rethinking the design of those functions that were made to operate on numpy
array.
As a workaround, you can always create datasets with your new variables and then use subset
.
For example:
import clouddrift as cd
import numpy as np
import xarray as xr
ds = xr.load_dataset("gdp-v2.01.1.zarr", engine="zarr", decode_times=False)
segment_size = cd.ragged.segment(ds["time"], 3600, ds["rowsize"])
# we now want to keep data that are only 5 days (5*24=120 hours or points) long
# variables we want to work with are lon, lat, time, ve, vn
min_length = 120
lon, segment120_size = cd.ragged.prune(ds["lon"], segment_size, min_length)
lat, _ = cd.ragged.prune(ds["lat"], segment_size, min_length)
time, _ = cd.ragged.prune(ds["time"], segment_size, min_length)
ve, _ = cd.ragged.prune(ds["ve"], segment_size, min_length)
vn, _ = cd.ragged.prune(ds["vn"], segment_size, min_length)
print(len(segment120_size)) # 48058 this is the number of segments that are at least 120 hours long
print(np.sum(segment120_size)) # 195462963 this is the number of observations in the segments that are at least 120 hours long
# only seg/obs dimensions
ds_segmented = xr.Dataset(
{
"lon": ("obs", lon),
"lat": ("obs", lon),
"ve": ("obs", ve),
"vn": ("obs", vn),
"segmentsize": ("seg", segment120_size),
},
coords={
"time": ("obs", time),
"segment": ("seg", np.arange(len(segment120_size)))
},
)
# then you can use subset
cd.ragged.subset(ds_segmented, {"segmentsize": (120, np.inf)}, row_dim_name="seg", id_var_name="segment", rowsize_var_name="segmentsize")
Thanks @philippemiron. Yes, for sure, your proposed solution is what I ended up doing. But of course we lose the information from the variables with dimension traj
such as location_type
which would indicate which segments come from GPS-tracked and Argos-tracked drifters. Solution is to subset first based on the `traj variables first.
But even if the segments were in the same dataset, there is no link between the traj
right now with how the ragged.segment
and ragged.prune
function are working.
I am wondering if it would not be worth to extend some of the functionalities of the
ragged
module to operate not only on ragged arrays, such as xarray DataArrays, but also on xarray Datasets.As an example, imagine we want to use the
segment
function to "split" the trajectories of a ragged xarray datasetds
with dimensionstraj
andobs
and a row size variablerowsize
of dimensiontraj
. Thesegment
function might be applied on a ragged array variable ofds
of dimensionobs
, such asds["time"]
, and returns an array which is a new rowsize variable callednew_rowsize
that segments/divides the input array into new rows (more rows than previously). Then, what if we want to substitute thatnew_rowsize
in the original xarray datasetds
and work from there? In other words we would need to transform the entire xarray dataset to change the dimensiontraj
to matchlen(new_rowsize)
. This would include splitting also accordingly all the variables of dimensiontraj
to map them on the new dimensionlen(new_rowsize)
.Or maybe this type of functionality should be folded in
subset
? @philippemiron I would love to hear what you think.Here is what I tried which in the end does not work: