TileDB-Inc / TileDB-Py

Python interface to the TileDB storage engine
MIT License
190 stars 34 forks source link

Documentation about multi_index and query #347

Open michael-imbeault opened 4 years ago

michael-imbeault commented 4 years ago

I can't find mentions of multi_index nor for the query() method in the official docs - been using multi_index but it is outputting a lot more information that I need (about positions in the array, then the values themselves). Is there a parameter to output just a list of results containing only values following the order of the slices? And what is the purpose of .query, is there any more to it than just another way to read results instead of using A[:] ?

Is there any efficiency benefit at using multi_index vs looping through intervals using query or the normal [] read mode?

ihnorton commented 4 years ago

Hi @michael-imbeault, I will be taking a pass through the API docs this week to add some missing items, as well as fix a rendering issue preventing some docstrings from displaying. We also have documentation of multi_index specifically at https://docs.tiledb.com/main/api-usage/reading-arrays/multi-range-subarrays

Here is a summary for multi_index and query:

multi_index:

.query:

Is there any efficiency benefit at using multi_index vs looping through intervals using query or the normal [] read mode?

For large multi-ranged queries, there can be a significant benefit to using multi_index, because TileDB is designed to efficiently fulfill such a query even for a very large number of ranges (parallelizing operations across multiple threads; storing range bounding boxes for tiles to optimize retrieval; selectively decompressing tiles; and other optimizations).

There can be an efficiency benefit to using .query if you know that some attribute results will not be needed, because core TileDB will not retrieve data for those attributes at all, reducing i/o and memory usage, etc.

michael-imbeault commented 4 years ago

Ok that's helpful - I did find https://docs.tiledb.com/main/api-usage/reading-arrays/multi-range-subarrays but its a little barebones at the moment - no mention of either multi_index nor query in https://tiledb-inc-tiledb-py.readthedocs-hosted.com/en/stable/python-api.html.

I'll be using multi_index - my initial expectation was that it would return a list of numpy arrays corresponding to the slices, not a dict with a single array encompassing all the slices I have to parse using the coordinate arrays. Is there plans to include a simple, already parsed output? The current way make sense for sparse arrays but seems suboptimal for dense arrays - creating those (potentially very large) coord arrays and keeping them in memory seems wasteful for some use cases.