TGSAI / mdio-cpp

C++, Cloud native, scalable storage engine for various types of energy data.
Apache License 2.0
6 stars 3 forks source link

Create "index" for coordinates #103

Open markspec opened 2 months ago

markspec commented 2 months ago

Problem

For HPC applications it may be necessary to write Datasets with Dimension Coordinates that are not geophysically meaningful. These volumes will be accessed via Coordinates. Downstream applications will greatly benefit from having extra metadata with the unique values of those coordinates (e.g. inline_values)

Solution

Add a 1D *_values to the metadata for each Coordinate that is more than 1D. This will contain the unique values of the >1D coordinates.

tasansal commented 2 months ago

Can you elaborate on this requirement? Putting more than 1D coordinates in JSON may be problematic opening the Zarr files.

Is this an application specific metadata? In that case it shouldn't be a part of MDIO but it should be in the MDIO application and written via the api as user attribute.

markspec commented 2 months ago

It is not application specific. In order to access an MDIO file based on coordinates the values for those coordinates needs to known (think options for inline/xline on a post-stack volume or shot/shotline in a viewing utility). If these are not dimensions but multi-dimensional coordinates the only option is to scan the entire coordinate and finding the unique values before they can be used by any downstream application. I do not have a strong opinion on where such information should be but it absolutely should be part of mdio otherwise coordinate based indexing implementations will be inconsistent and slow (strictly speaking not accessing based on coordinates but providing the range of possible values to index on).

tasansal commented 2 months ago

I understand the need but I still think core MDIO should not require this. It is very specific to downstream applications. Multi-dimensional coordinates may not necessarily unique and/or need to be indexed. For instance: having X/Y coordinates in 2D. It doesn't make sense to store ALL values in JSON. We don't always want to index them. Having it in two places also will cause synchronization concerns and may confuse users. It may make sense to create some indexing utilities for this though. The multi-dimensional coordinates are all stored separately and they're small. Would be pretty cheap to read, cache, and index on the fly.

markspec commented 2 months ago

image

A simple post-stack test indicated this will introduce a ~3s overhead on file open. For gathers/pre-stack this could start getting significant and will present an observable performance hit for visualization.

BrianMichell commented 2 months ago

Had a chat with @markspec offline. We will use a combination of Coordinate Variable SummaryStatistics and user defined attributes to help with this bottleneck.

tasansal commented 2 months ago

I still think the application can just fetch is AS needed and cache it in memory. What was the decision and what are we going to implement?

BrianMichell commented 2 months ago

The plan is to use the summary stats min and max to get ranges and add a custom attribute isRegular: bool step: int/float for each coordinate Variable. Irregular steps is beyond scope for the current usecase.

The actual indexing of coordinates will be handled by #102 with user specification of load and cache policies.