BrianMichell commented 3 months ago

Problem

Datasets may contain both Coordinates and Dimension Coordinates. Currently we only support slicing datasets along the Dimension Coordinates via isel.

Solution

Implement numpy style sel semantics for MDIO.

Considerations

We will need to eagerly read all Dimension Coordinates (and possibly Coordinates).
Dimension Coordinates may not be in sorted order (e.g. chunk 0 may contain values 10-19 and chunk 1 may contain values 0-9).
- Will we consider the possibility of completely arbitrary order?
How do we want to handle the case where we don't have chunk-aligned Dimension Coordinate slices?
- We could hold multiple Datasets in memory, but this doesn't seem practical for an end-user.
- We could disallow this case and return an invalid slice Result. This should be simple to implement, but greatly reduce the power of the slicing feature.
- We could slice the overall extent of a Variable and compute a table of the logical indices to use.
How should we handle padded Dimension Coordinates? E.g. If it is not written perfectly chunk-aligned there will be fill values written. How should we define this behavior?

markspec commented 2 months ago

For querying large datasets we need to know the values in the coordinates. The minimal case would be a post-stack volume that was output with dimensions: worker_id, tesk_id, trace, sample

With coordinates inline, xline. inline and xline will be of dimensions worker_id, task_id, trace. In order to make a slice on inline,xline we first need to find the unique values and sort. This obviously will have server performance implications for accessing the file this was. I propose we update the standard to require a 1D coordinate with the possible values on inline and xline. This will greatly simplify accessing based on coordinates.

tasansal commented 2 months ago

I think we need to start simple with the .sel and .loc methods. The minimal case is slicing with 1D dimension coordinates. Xarray doesn't even support complex out of order slicing like this and its up to the downstream application to figure it out. i.e. to sort and get the data from something sorted like a/b/c/d based on coordinates x/y requires some crazy operations even in Xarray.

For this dataset:

You have to do all this to get it right:

mask = (ds.inline == 11) | (ds.inline == 12)
ds_final = (
    ds.where(mask, drop=True)  # get our AOI
    .stack(combined_traces=["worker", "task", "trace"], create_index=False)  # combine worker/task/trace
    .dropna("combined_traces", how="all")  # drop any leftover NaN from .where
    .set_xindex(coord_names=["inline", "crossline"])  # index IL/XL coordinates
    .unstack("combined_traces")  # Go to data domain
    .transpose("inline", "crossline", "time")  # order dimensions correctly
)
ds_final

As I mentioned before, I highly recommend first implementing simple dimension coordinate indexing and then focus on edge cases like this to make it possible for end-user to put up a flow like this. The above example has the same performance as long as traces are sorted properly within worker/task combination.

markspec commented 2 months ago

@tasansal If we do not have this data generated in the write, the same code will implemented and run independently each time the code is accessed. This will make implementing the write a lot more complex, be a performance bottleneck, require code duplication and code rewrite in the near future.

It is required to define the mask and does not impact the code you posted.

TGSAI / mdio-cpp

Add support for dimension-based slicing #85

Problem

Solution

Considerations