Cloud-Drift / clouddrift

CloudDrift accelerates the use of Lagrangian data for atmospheric, oceanic, and climate sciences.
https://clouddrift.org/
MIT License
37 stars 8 forks source link

⭐ (feature) ragged operation on ragged xarray *datasets* #527

Open selipot opened 4 days ago

selipot commented 4 days ago

I am wondering if it would not be worth to extend some of the functionalities of the ragged module to operate not only on ragged arrays, such as xarray DataArrays, but also on xarray Datasets.

As an example, imagine we want to use the segment function to "split" the trajectories of a ragged xarray dataset ds with dimensions traj and obs and a row size variable rowsize of dimension traj. The segment function might be applied on a ragged array variable of ds of dimension obs, such as ds["time"], and returns an array which is a new rowsize variable called new_rowsize that segments/divides the input array into new rows (more rows than previously). Then, what if we want to substitute that new_rowsize in the original xarray dataset ds and work from there? In other words we would need to transform the entire xarray dataset to change the dimension traj to match len(new_rowsize). This would include splitting also accordingly all the variables of dimension traj to map them on the new dimension len(new_rowsize).

Or maybe this type of functionality should be folded in subset? @philippemiron I would love to hear what you think.

Here is what I tried which in the end does not work:

# open locally the global drifter hourly dataset, alternatively can be opened from the cloud
# ds = cd.datasets.gdp1h()
# probably faster to not decode times
ds = xr.open_dataset("/Users/selipot/Data/awss3/latest/gdp-v2.01.1.zarr",engine="zarr",decode_times=False)
print(np.sum(ds['rowsize'])) # 197214787

# %% need to segment first; no gaps in time larger than 3600 seconds
segment_size = cd.ragged.segment(ds["time"],3600,ds["rowsize"])
print(len(segment_size)) # 100235
print(np.sum(segment_size)) # 197214787 # this keeps the same number of obs

# %% we now want to keep data that are only 5 days (5*24=120 hours or points) long
# variables we want to work with are lon, lat, time, ve, vn
min_length = 120
lon, segment120_size = cd.ragged.prune(ds["lon"],segment_size,min_length)

print(len(segment120_size)) # 48058 this is the number of segments that are at least 120 hours long
print(np.sum(segment120_size)) # 195462963 this is the number of observations in the segments that are at least 120 hours long

# %% now trying to subset the data to only keep the segments that are at least 120 hours long
# add to `ds` a new variable `segmentsize` with a new dimension that is segment = len(segment120_size)
ds = ds.assign_coords(segment=("segment", np.arange(len(segment120_size))))
ds["segmentsize"] = (("segment",), segment120_size)

print(ds)

# %%
ds2 = cd.ragged.subset(ds, {"segmentsize": (120, np.inf)}, row_dim_name="segment", id_var_name="id")
ds2
philippemiron commented 1 day ago

Hi @selipot ,

I see what you want to accomplish, but I'm not sure how this could be automated, especially when dealing with xr.Dataset. Of course, it's not impossible, but it becomes quickly complicated when adding coordinates with different dimensions.

For example, when you do:

segment_size = cd.ragged.segment(ds["time"], 3600, ds["rowsize"])

how the function is setup right now, is not simple to link this new dimension of len(segment_size) = 100235 with the previous dimension traj = 19396.

Plus, when you get to the last step, print(np.sum(segment120_size)), which is 195462963, is not equal to the length of the total obs in the dataset. Again, it is not easy right now to remove the 2M obs that are not included anymore.

For this to work, we would need to recreate a new dataset at each step to adjust the dimensions of obs and assign the correct coordinates. As I said, it's feasible but would require rethinking the design of those functions that were made to operate on numpy array.

As a workaround, you can always create datasets with your new variables and then use subset.

For example:

import clouddrift as cd
import numpy as np
import xarray as xr

ds = xr.load_dataset("gdp-v2.01.1.zarr", engine="zarr", decode_times=False)
segment_size = cd.ragged.segment(ds["time"], 3600, ds["rowsize"])

# we now want to keep data that are only 5 days (5*24=120 hours or points) long
# variables we want to work with are lon, lat, time, ve, vn
min_length = 120
lon, segment120_size = cd.ragged.prune(ds["lon"], segment_size, min_length)
lat, _ = cd.ragged.prune(ds["lat"], segment_size, min_length)
time, _ = cd.ragged.prune(ds["time"], segment_size, min_length)
ve, _ = cd.ragged.prune(ds["ve"], segment_size, min_length)
vn, _ = cd.ragged.prune(ds["vn"], segment_size, min_length)

print(len(segment120_size)) # 48058 this is the number of segments that are at least 120 hours long
print(np.sum(segment120_size)) # 195462963 this is the number of observations in the segments that are at least 120 hours long

# only seg/obs dimensions
ds_segmented = xr.Dataset(
    {
        "lon": ("obs", lon),
        "lat": ("obs", lon),
        "ve": ("obs", ve),
        "vn": ("obs", vn),
        "segmentsize": ("seg", segment120_size),
    },
    coords={
        "time": ("obs", time), 
        "segment": ("seg", np.arange(len(segment120_size)))
    },
)

# then you can use subset
cd.ragged.subset(ds_segmented, {"segmentsize": (120, np.inf)}, row_dim_name="seg", id_var_name="segment", rowsize_var_name="segmentsize")
selipot commented 8 hours ago

Thanks @philippemiron. Yes, for sure, your proposed solution is what I ended up doing. But of course we lose the information from the variables with dimension traj such as location_type which would indicate which segments come from GPS-tracked and Argos-tracked drifters. Solution is to subset first based on the `traj variables first.

philippemiron commented 3 hours ago

But even if the segments were in the same dataset, there is no link between the traj right now with how the ragged.segment and ragged.prune function are working.