TGSAI / mdio-cpp

C++, Cloud native, scalable storage engine for various types of energy data.
Apache License 2.0
6 stars 3 forks source link

Add support for dimension-based slicing #85

Open BrianMichell opened 3 months ago

BrianMichell commented 3 months ago

Problem

Datasets may contain both Coordinates and Dimension Coordinates. Currently we only support slicing datasets along the Dimension Coordinates via isel.

Solution

Implement numpy style sel semantics for MDIO.

Considerations

markspec commented 2 months ago

For querying large datasets we need to know the values in the coordinates. The minimal case would be a post-stack volume that was output with dimensions: worker_id, tesk_id, trace, sample

With coordinates inline, xline. inline and xline will be of dimensions worker_id, task_id, trace. In order to make a slice on inline,xline we first need to find the unique values and sort. This obviously will have server performance implications for accessing the file this was. I propose we update the standard to require a 1D coordinate with the possible values on inline and xline. This will greatly simplify accessing based on coordinates.

tasansal commented 2 months ago

I think we need to start simple with the .sel and .loc methods. The minimal case is slicing with 1D dimension coordinates. Xarray doesn't even support complex out of order slicing like this and its up to the downstream application to figure it out. i.e. to sort and get the data from something sorted like a/b/c/d based on coordinates x/y requires some crazy operations even in Xarray.

For this dataset: image

You have to do all this to get it right:

mask = (ds.inline == 11) | (ds.inline == 12)
ds_final = (
    ds.where(mask, drop=True)  # get our AOI
    .stack(combined_traces=["worker", "task", "trace"], create_index=False)  # combine worker/task/trace
    .dropna("combined_traces", how="all")  # drop any leftover NaN from .where
    .set_xindex(coord_names=["inline", "crossline"])  # index IL/XL coordinates
    .unstack("combined_traces")  # Go to data domain
    .transpose("inline", "crossline", "time")  # order dimensions correctly
)
ds_final

As I mentioned before, I highly recommend first implementing simple dimension coordinate indexing and then focus on edge cases like this to make it possible for end-user to put up a flow like this. The above example has the same performance as long as traces are sorted properly within worker/task combination.

markspec commented 2 months ago

@tasansal If we do not have this data generated in the write, the same code will implemented and run independently each time the code is accessed. This will make implementing the write a lot more complex, be a performance bottleneck, require code duplication and code rewrite in the near future.

It is required to define the mask and does not impact the code you posted.