Cloud-Drift / clouddrift

CloudDrift accelerates the use of Lagrangian data for atmospheric, oceanic, and climate sciences.
https://clouddrift.org/
MIT License
37 stars 8 forks source link

`velocity_from_position`: Handle ragged array #70

Closed milancurcic closed 1 year ago

milancurcic commented 1 year ago

Previous discussion in #68.

@selipot suggested that velocity_from_position should also handle ragged arrays as input. Let's discuss here what these ragged arrays look like. I.e. is the ragged array in the form of an xarray Dataset as generated by clouddrift or something else?

selipot commented 1 year ago

I would say it can be Xarray Datarray, Akward array, numpy array? In cases other than the nested awkward array we need to think of the way by which the function is made aware of trajectory breaks with an extra argument perhaps? As an example the 'id' or the 'rowsize' variables? This method will be generalizable to any other function we will write. We can go back to the EarthCube notebook to see how this was handled.

selipot commented 1 year ago

Could any of our analysis functions could take optional arguments specifying the underlying structure of the data? I.e. option rowsize=rowsize for ragged arrays and dim='obs or axis=n for structured arrays such as xarray DataArrays?

milancurcic commented 1 year ago

I like the rowsize approach.

If rowsize (array-like of ints, optional, default None) is absent, the computation defaults to the N-d structured array implementation.

If rowsize is provided, require that x, y, and time are 1-d arrays and apply boundary conditions at the start and end of each segment.

dim argument doesn't apply since our function tasks array-likes (I think this is a good design choice), and we already have time_axis to specify along which axis to differentiate.

If rowsize is provided, we can raise a warning if time_axis is also provided, and proceed with the computation (if we want to be more lax), or raise an error if we want to be more strict. Either is OK, it's a style choice.

milancurcic commented 1 year ago

@selipot and I discussed jLab's splitcell (IIRC) function which splits a ragged array into a list of varying-length arrays.

That's easy for us to do as well given that we have rowsize for housekeeping. This made me think of an alternative approach to handling ragged array in velocity_from_position and similar functions that need to be trajectory-bounds aware:

The function could accept, in addition to array-likes, lists of array-likes. If the arguments are lists of array-likes, their elements are assumed to be contiguous arrays (trajectories), and the function would recursively run itself on each element and return lists of array-likes as result.

The downside of this approach is that (I think) it would require a copy of the data in the process (ragged array DataArray -> list[DataArray]). Another downside is that it would place the opportunity-to-parallelize inside the function (so it moves the parallelization responsibility from the user to the library), rather than to the user. The upside is that the implementation would be very easy.

Another approach is to not touch the existing function but implement a nicely syntaxed splitcell and let the user do velocity_from_position(splitcell(x), splitcell(y), splitcell(time)) and similar. (nevermind, this is the same as above)

milancurcic commented 1 year ago

This is done with apply_ragged.