Open itsgifnotjiff opened 8 months ago
Hi @itsgifnotjiff, thanks for your feature request! I'm not familiar with TileDB, @mathleur will check out the docs and see if its doable.
We already have an xarray backend for polytope, is it possible to use xarray as a layer to read from a TileDB? I've seen this https://github.com/TileDB-Inc/TileDB-CF-Py but no experience with it.
Thank you for your work and consideration Mr. @jameshawkes.
Yes they have integrated with xarray by specifying xr.open_dataset(uri, engine="tiledb")
. But I am really curious to see if the query can not be pushed down to their C++ backend itself.
I think Mrs. Julia Dark @jp-dark , Mr. Isaiah Norton @ihnorton or Mr. Nick Kules can confirm if it is doable.
Hi @itsgifnotjiff ,
@mathleur has made a minimal example showing use of tileDB though the xarray engine: https://gist.github.com/mathleur/7314e497d4fdd0224a3a3a8d92ba58d2
Would be good to understand if there are disadvantages to this xarray engine.
Mr. @jameshawkes and Mrs. @mathleur , this is very promissing and I will try and run some benchmarks on a 4Tb NWP Model Outputs TileDB Array. Is there any special querries you would suggest to stress test Polytope?
I would start simple with some point-profiles (e.g. time-series, vertical profiles) and some area extractions (e.g. boxes, polygons). We know that the xarray backend is not highly optimised, so I expect we will see some scaling issues at first, but we are actively working on this.
Hi @itsgifnotjiff and @jameshawkes, happy to help answer any questions regarding TileDB. You may see some decreases in performance compared to accessing TileDB directly since there are a few more steps going through the xarray backend but it should be completely doable workflow wise. The example from @mathleur looks good.
Yordan if you're using some of the datasets you've already ingested into TileDB it would just be using everything after opening the dataset in xarray (line 44 onward in the example). If you do notice that performance is outside what you'd find acceptable we're happy to help investigate to figure out where the bottleneck might be.
@jameshawkes and @mathleur sorry to bother you but i have a TileDB Array with a couple of TB of data in RottatedPole that looks like the image below and I am trying to construct some querries but understandably Polytope documentation does not give a lot of examples of how to build different queries.
If it is not too much to ask can you help me build a
and any other query you think would be pertinent to test please?
or even better let's use this cube
slicer = HullSlicer()
polytope_API = Polytope(datacube=ds.TT, engine=slicer)
canada_bbox = [-141, -50, 39, 83],
box1 = Box(["lon", "lat"], [-141, -50], [39, 83])
request = Request(box1)
result = polytope_API.retrieve(request)
print([leaf.result for leaf in result.leaves])
but I get
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[8], line 2
1 slicer = HullSlicer()
----> 2 polytope_API = Polytope(datacube=ds.TT, engine=slicer)
4 canada_bbox = [-141, -50, 39, 83],
6 box1 = Box(["lon", "lat"], [-141, -50], [39, 83])
File ~/.conda/envs/polytope_exp/lib/python3.11/site-packages/polytope/polytope.py:49, in Polytope.__init__(self, datacube, engine, axis_options, datacube_options)
46 if datacube_options is None:
47 datacube_options = {}
---> 49 self.datacube = Datacube.create(datacube, axis_options)
50 self.engine = engine if engine is not None else Engine.default()
File ~/.conda/envs/polytope_exp/lib/python3.11/site-packages/polytope/datacube/backends/datacube.py:156, in Datacube.create(datacube, axis_options, datacube_options)
153 if isinstance(datacube, (xr.core.dataarray.DataArray, xr.core.dataset.Dataset)):
154 from .xarray import XArrayDatacube
--> 156 xadatacube = XArrayDatacube(datacube, axis_options, datacube_options)
157 return xadatacube
158 else:
File ~/.conda/envs/polytope_exp/lib/python3.11/site-packages/polytope/datacube/backends/xarray.py:37, in XArrayDatacube.__init__(self, dataarray, axis_options, datacube_options)
35 if self.dataarray[name].dims == ():
36 options = axis_options.get(name, None)
---> 37 self._check_and_add_axes(options, name, values)
38 treated_axes.append(name)
39 for name in dataarray.dims:
File ~/.conda/envs/polytope_exp/lib/python3.11/site-packages/polytope/datacube/backends/datacube.py:76, in Datacube._check_and_add_axes(self, options, name, values)
74 DatacubeAxis.create_standard(name, values, self)
75 elif name not in self._axes.keys():
---> 76 DatacubeAxis.create_standard(name, values, self)
File ~/.conda/envs/polytope_exp/lib/python3.11/site-packages/polytope/datacube/datacube_axis.py:108, in DatacubeAxis.create_standard(name, values, datacube)
105 @staticmethod
106 def create_standard(name, values, datacube):
107 values = np.array(values)
--> 108 DatacubeAxis.check_axis_type(name, values)
109 if datacube._axes is None:
110 datacube._axes = {name: deepcopy(_type_to_axis_lookup[values.dtype.type])}
File ~/.conda/envs/polytope_exp/lib/python3.11/site-packages/polytope/datacube/datacube_axis.py:120, in DatacubeAxis.check_axis_type(name, values)
116 @staticmethod
117 def check_axis_type(name, values):
118 # NOTE: The values here need to be a numpy array which has a dtype attribute
119 if values.dtype.type not in _type_to_axis_lookup:
--> 120 raise ValueError(f"Could not create a mapper for index type {values.dtype.type} for axis {name}")
ValueError: Could not create a mapper for index type <class 'numpy.float32'> for axis pres
Hi @itsgifnotjiff, the main point for Polytope shapes is to define shapes on all dimensions of the datacube. In your second datacube, this would be forecast, lat and lon. So to define:
a flight path: Request( Path([“lat”, “lon”], Box[“lat”, “lon”], [0,0], [1,1]), flight_points), Select(“forecast”, [0]))
(Note that this will be static in time and altitude because of the provided datacube, but when there are time and altitude dimensions available, a flight path should be a 4D path, which is a sweep of a 4D box along 4D flight points)
Canada: Request(Polygon([“lat”, “lon”], canada_points), Select(“forecast”, [0]))
a bounding box: Request(Box([“lat”, “lon”], [0,0], [1,1]), Select(“forecast”, [0]))
a point: Request(Point([“lat”, “lon”], [[0,0]], method=“surrounding”), Select(“forecast”, [0))
(Note that this will take all the surrounding points to the requested point available on the datacube)
a vertical profile: This is difficult to achieve with the provided datacube, but it would probably be a Point
in the lat/lon dimensions “augmented” by a Select
in the “forecast” dimension (like in all the other examples above) and a Span
in an altitude dimension.
Note also that the datacube needs to be defined as an Xarray DataArray instead of a Dataset currently.
Finally, to solve the shown ValueError, we've updated the develop branch, so it should now work! For any similar ValueErrors, each type of datacube dimension value needs to have an associated DatacubeAxis defined in the _type_to_axis_lookup
dictionary (at the bottom of the /polytope/datacube/datacube_axis.py file). In this case, for example, the type np.float32 had be added as a FloatDatacubeAxis().
I hope this helps but I would be happy to answer more questions and will update the documentation to make these steps clearer!
Is your feature request related to a problem? Please describe.
I wonder if we can use something else as the data source Polytope queries. Zarrs, Kerchunked files are fine, but TileDB seems to be a pretty strong alternative.
Describe the solution you'd like
Ideally I would be able to query TileDB Arrays using Polytope's API.
Describe alternatives you've considered
I know you can use GDAl, Rasterio, Fiona, PDAL, Geopandas, Xarray and others to query data extracted from a TileDB Array, but I would love to be able to query the Array itself using Polytope.
Additional context
The GRIB Jump base is awesome but almost every Meteorological Center around the world has their own format that they need to either cast to GRIB or implement the logic you've implemented. Also from what I understand Polytope works on the GRIBs but TileDB can host Raster, Tabular, Vector, Imagery and Observational dense and sparse data in the same Array so I imagine being able to ask for the Model Outputs Raster, Warning Zones Vector, Station and Satelite Observational and Satelite Imagery data within a 4D Polygon (Flight path) and get all relevant information.
Organisation
Environment and Climate Change Canada