TileDB-Inc / TileDB-Py

Python interface to the TileDB storage engine
MIT License
185 stars 34 forks source link

Comparison with HDF5(pytables) #346

Open graykode opened 4 years ago

graykode commented 4 years ago

I compared numpy array of shapes (200000, 784) with the use of a dense-array in tileDB and the 'create_array' of pytable. However, the speed of the tileDB's I/O is significantly slower than that of the pytable.

tiledb

import tiledb
import tables as tb
import numpy as np
import datetime as dt

n = 200000

dom = tiledb.Domain(tiledb.Dim(name="a", domain=(0, n-1), tile=n, dtype=np.int32),
                    tiledb.Dim(name="b", domain=(0, 784-1), tile=784, dtype=np.int32))
print(dom)

# Add two attributes "a1" and "a2", so each (i,j) cell can store
# a character on "a1" and a vector of two floats on "a2".
schema = tiledb.ArraySchema(domain=dom, 
                            sparse=False, 
                            attrs=[
                                tiledb.Attr(name="a2", dtype=np.float32)
                            ])

# Create the (empty) array on disk.
array_name = "dense_array"
tiledb.DenseArray.create(array_name, schema)

%%time
ran = np.random.rand(n, 784)
with tiledb.DenseArray(array_name, mode='w') as A:
    A[:,:] = ran

CPU times: user 3.4 s, sys: 3.09 s, total: 6.49 s
Wall time: 11.7 s

pytable

!rm -f data/array.h5

import tables as tb
import numpy as np
from os.path import expanduser

h5 = tb.open_file("data/array.h5", 'w') 

n = 200000

ear = h5.create_array(h5.root, 'sub',
                       atom=tb.Float64Atom(),
                       shape=(n, 784))

%%time
ran = np.random.rand(n, 784)

ear[:,:,:,:] = ran

CPU times: user 1.88 s, sys: 1.15 s, total: 3.03 s
Wall time: 5.66 s

I'd like to know the cause. I look forward to hearing some solutions. @ihnorton

stavrospapadopoulos commented 4 years ago

Hi @graykode, thanks for posting this issue!

Quick note: In your example above, you may want to the following modifications:

How TileDB performs writes

TileDB follows a general algorithm that applies to both dense and sparse arrays, with any arbitrary filter (e.g., compression, encryption, etc) and all layouts. It works as follows:

  1. Receive the data to be written directly from the numpy internal buffer
  2. Prepare the tiles, by potentially reshuffling the data in case the layout to be written on disk is different from the layout inside the numpy buffers. This involves 1 full copy.
  3. Filter each tile, by applying further chunking and running the filter pipeline. This involves another full copy.
  4. Write the filtered tiles to disk.

Everything internally is heavily parallelized with TBB, but still the algorithm needs to perform the 2 fully copies I mention above.

What happens in your example

Your example is a special scenario where you have a single tile and, therefore, the cells in the numpy buffer have the same layout as the one that will be written to disk. Moreover, you are not specifying any filter. Consequently, it is possible to avoid the 2 extra copies, which is what I assume that HDF5 does. Hence the difference in performance.

The solution

We need to optimize for special scenarios like this. It is fairly easy before executing the query to determine whether it is a special scenario where no cell shuffling and filtering is involved. In those cases we should resort to directly writing from the user buffers (e.g., numpy arrays) to disk, bypassing the tile preparation and filtering steps.

Thanks for putting this on our radar. We'll implement the optimizations and provide updates soon.

@ihnorton @joe-maley @Shelnutt2

graykode commented 4 years ago
  • np.float32

Thanks for your reply First, I cannot create a schema dimension of float type. TileDBError: [TileDB::ArraySchema] Error: Cannot set domain; Dense arrays do not support dimension datatype 'FLOAT32 or 64' Second, Removing ran = np.random.rand(n, 784) out of timings does not make a significant difference in time with existing results.

I read your solution and understand that it simply means to save it in the form of a buffer, just np.save(). Do you have an example code for this scenario using tiledb?

Thanks @stavrospapadopoulos

stavrospapadopoulos commented 4 years ago

The float32 -> float64 is for the attribute a2, not the dimensions. You currently set the type to np.float32 (32-bit float), whereas in PyTables you use Float64Atom (64-bit float). In other words, you write less data in TileDB than HDF5. Here is all you need to do:

schema = tiledb.ArraySchema(domain=dom, 
                            sparse=False, 
                            attrs=[
                                tiledb.Attr(name="a2", dtype=np.float64) # <- this is the change here
                            ])

Regarding the solution, you do not need to do anything :-). We need to implement the solution I suggested inside TileDB core and once we merge a patch, you will just get the performance boost without changing your code.

I hope this helps.

graykode commented 4 years ago

Regarding the solution, you do not need to do anything :-). We need to implement the solution I suggested inside TileDB core and once we merge a patch, you will just get the performance boost without changing your code.

I understand what you're saying. Then I will wait for your patch. Thanks