TileDB-Inc / TileDB

The Universal Storage Engine
https://tiledb.com
MIT License
1.82k stars 181 forks source link

Range based dense data ingestion #1835

Open aosterthun opened 3 years ago

aosterthun commented 3 years ago

Hi,

I wont to ingest large datasets into TileDB that do not fit into main memory at once. Therefore I write my buffers into TileDB whenever the buffer exceeds a certain memory threshold. While generating my buffers I'm iterating over the whole array domain.

Let's assume I have an array with 3 dimensions and 1 attribute, where all dimensions have a domain of <0,100>, and the dimensions as well as the attributes are of type int.

I'm generating the data while iterating the array domain recursively given me a matrix index for each cell of the array, e.g. [0][0][0], [0][0][1],...,[100][100][100]. Given these indices I generate a range where I would like to ingest to. When for example using a threshold of 1 GB I would get a range of [0][0][0]-[2][23][59]. Since these ranges cross dimensions this can't be a normal sub-array. Is there a way to directly use such a index range to ingest my data into TileDB.

  1. Addition: Would the add_range() member function of tiledb::Query work for this ?
    query.add_range(0,0,2);
    query.add_range(1,0,23);
    query.add_range(2,0,59);

Or would that just result in the sub-array: [0,2,0,23,0,59] ?

ihnorton commented 3 years ago

Hi @aosterthun, you can do this as follows: set a coordinate buffer for all dimensions, with coordinate counts matching the number of data cells you are writing. Here is a quick demo: https://gist.github.com/ihnorton/00480fb03266f975478e7dfe645744a2 (creates a 3D array of 1s, then overwrites only two cells).

A few notes:

ihnorton commented 3 years ago

Hi @aosterthun -- @stavrospapadopoulos pointed out that I probably misread your question (you just need to write slices rather than individual points).

In that case, yes using add_range for each dimension slice will work fine. You can write in row- or col-major as long as you write in a hyper-rectangle. For example, could write in [1:10][1:100][1:100] first, then in [11:20][1:100][1:100] next etc, and if you are careful about the space tile extent there will be no padding.

stavrospapadopoulos commented 3 years ago

Some more information regarding padding: https://docs.tiledb.com/main/solutions/tiledb-embedded/internal-mechanics/writing

aosterthun commented 3 years ago

When I understand correctly I'm required to provide the query with a sub-array in form of a hyper-rectangle. In my case I would like to provide the query a contiguous piece of data that is in TILEDB_GLOBAL_ORDER, but not necessarily in form of a hyper-rectangle.

IMG_1504

In the picture above I drew up a quick example showing a 4x3x2 array with cells to write two marked in green. In order to write the data in this range based form I would currently would have to divide this query into at least three queries. The first to write to [6-7] the second to write [8-11] and the third to write to [12-14].

stavrospapadopoulos commented 3 years ago

Hi @aosterthun, you can achieve the above by setting the write layout to TILEDB_GLOBAL_ORDER, the subarray to the entire array domain (e.g., [1,4], [1,3], [1,2]) and keep on streaming the data into the array with the following operations (in pseudo code):

open array for writes
create write query
allocate buffers
set buffers to query
while you still have data to write:
    populate buffers with the next data (and update buffer size)
    submit query
finalize query (important)

In order for the above to work, your values should be given in global order. So the global order layout has the benefit of streaming, but you need to be very careful to provide the values in the global order (i.e., populate tile by tile based on your space tile extents, and then respect the row-/col-major layout within each tile, as well as the tile layout).

For example, if you have a 4x4 array with space tile extents 2 as follows:

 1  2 |  3  4 
 5  6 |  7  8
-------------
 9 10 | 11 12
13 14 | 15 16

and your cell and tile layout is row-major, then you need to provide the values in the buffers as follows:

1, 2, 5, 6, 3, 4, 7, 8, 9, 10, 13, 14, 11, 12, 15, 16

You can also stream that buffer in any way, e.g.,

1, 2, 5, 6, 3, 4, 7, 8, 9, 10, 13      // first write
14, 11, 12, 15, 16                     // second write 

I hope this helps.