Open BrianMichell opened 3 months ago
For querying large datasets we need to know the values in the coordinates. The minimal case would be a post-stack volume that was output with dimensions: worker_id, tesk_id, trace, sample
With coordinates inline, xline. inline and xline will be of dimensions worker_id, task_id, trace. In order to make a slice on inline,xline we first need to find the unique values and sort. This obviously will have server performance implications for accessing the file this was. I propose we update the standard to require a 1D coordinate with the possible values on inline and xline. This will greatly simplify accessing based on coordinates.
I think we need to start simple with the .sel
and .loc
methods. The minimal case is slicing with 1D dimension coordinates. Xarray doesn't even support complex out of order slicing like this and its up to the downstream application to figure it out. i.e. to sort and get the data from something sorted like a/b/c/d
based on coordinates x/y
requires some crazy operations even in Xarray.
For this dataset:
You have to do all this to get it right:
mask = (ds.inline == 11) | (ds.inline == 12)
ds_final = (
ds.where(mask, drop=True) # get our AOI
.stack(combined_traces=["worker", "task", "trace"], create_index=False) # combine worker/task/trace
.dropna("combined_traces", how="all") # drop any leftover NaN from .where
.set_xindex(coord_names=["inline", "crossline"]) # index IL/XL coordinates
.unstack("combined_traces") # Go to data domain
.transpose("inline", "crossline", "time") # order dimensions correctly
)
ds_final
As I mentioned before, I highly recommend first implementing simple dimension coordinate indexing and then focus on edge cases like this to make it possible for end-user to put up a flow like this. The above example has the same performance as long as traces are sorted properly within worker/task combination.
@tasansal If we do not have this data generated in the write, the same code will implemented and run independently each time the code is accessed. This will make implementing the write a lot more complex, be a performance bottleneck, require code duplication and code rewrite in the near future.
It is required to define the mask and does not impact the code you posted.
Problem
Datasets may contain both Coordinates and Dimension Coordinates. Currently we only support slicing datasets along the Dimension Coordinates via
isel
.Solution
Implement numpy style sel semantics for MDIO.
Considerations