TileDB-Inc / TileDB-Py

Python interface to the TileDB storage engine
MIT License
190 stars 33 forks source link

Indexing bugs #594

Open Hoeze opened 3 years ago

Hoeze commented 3 years ago

Hi @ihnorton, sorry to bother once more, but I think I found a couple of bugs in conjunction with indexing.

Setup:

```python # + import json import tiledb import numpy as np import pandas as pd import random # - import json test_df = pd.DataFrame.from_records(json.loads('{"chrom":{"0":"chr1","1":"chr1","2":"chr2","3":"chr2","4":"chr1","5":"chr1","8":"chr1","9":"chr1"},"log10_len":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":0,"9":0},"start":{"0":10108,"1":10108,"2":10108,"3":10108,"4":10108,"5":10108,"8":10143,"9":10143},"end":{"0":10114,"1":10114,"2":10114,"3":10114,"4":10114,"5":10114,"8":10144,"9":10144},"ref":{"0":"AACCCT","1":"AACCCT","2":"AACCCT","3":"AACCCT","4":"AACCCT","5":"AACCCT","8":"T","9":"T"},"alt":{"0":"A","1":"A","2":"A","3":"A","4":"A","5":"A","8":"C","9":"C"},"sample_id":{"0":"A","1":"B","2":"C","3":"D","4":"E","5":"F","8":"A","9":"B"},"GT":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":1,"9":1},"GQ":{"0":79,"2":60,"3":99,"4":26,"5":62,"8":22,"9":65},"DP":{"0":12,"1":9,"2":39,"3":26,"4":9,"5":9,"8":35,"9":34}}')) test_df output_path="test.tdb" ctx = tiledb.default_ctx() ctx # + genotype_domain = tiledb.Domain( tiledb.Dim(name="chrom", domain=(None,None), tile=1, dtype=np.bytes_, ctx=ctx), tiledb.Dim(name="log10_len", domain=(0, np.iinfo(np.int8).max), tile=1, dtype=np.int8, ctx=ctx), tiledb.Dim(name="start", domain=(0, np.iinfo(np.int32).max), tile=100000, dtype=np.int32, ctx=ctx), tiledb.Dim(name="alt", domain=(None,None), tile=None, dtype=np.bytes_, ctx=ctx), # tiledb.Dim(name="end", domain=(1, np.iinfo(np.int32).max), dtype=np.int32, ctx=ctx), tiledb.Dim(name="sample_id", domain=(None,None), tile=None, dtype=np.bytes_, ctx=ctx), ctx=ctx, ) string_filters = tiledb.FilterList([tiledb.ZstdFilter(level=-1),]) int_filters = tiledb.FilterList([tiledb.ByteShuffleFilter(), tiledb.ZstdFilter(level=-1),]) attrs = [ tiledb.Attr(name='end', dtype='int32', var=False, nullable=False, filters=int_filters), tiledb.Attr(name='ref', dtype='S', nullable=False, filters=string_filters), tiledb.Attr(name='GT', dtype='int8', var=False, nullable=False, filters=int_filters), tiledb.Attr(name='GQ', dtype='int32', var=False, nullable=True, filters=int_filters), tiledb.Attr(name='DP', dtype='int32', var=False, nullable=True, filters=int_filters), ] # - schema = tiledb.ArraySchema( domain=genotype_domain, attrs=attrs, sparse=True, cell_order="hilbert", # capacity=10000, ctx=ctx, ) schema if tiledb.array_exists(output_path): print("Deleting array at '%s'..." % output_path) import shutil shutil.rmtree(output_path) print("Creating array at '%s'..." % output_path) tiledb.array.SparseArray.create(output_path, schema, ctx=ctx) tiledb.from_dataframe( output_path, test_df.astype({ 'end': 'int32', 'ref': 'S', 'GT': 'int8', 'GQ': 'Int32', 'DP': 'Int32', }), sparse=True, mode="append" ) tiledb.open_dataframe(output_path, use_arrow=True) A = tiledb.open(output_path, ctx=ctx, mode='r') A ```

Now my trials:

# works
A.query(use_arrow=True, coords=True).df[:]

# works
A["chr2"]

# +
# kernel breaks with `realloc(): invalid pointer`
# A.df["chr2"]
# kernel breaks with `realloc(): invalid pointer`
# A.query(use_arrow=True, coords=True).df["chr2"]
# -

# empty result
A.multi_index["chr2"]

# works
A["chr1", 0]

# works
A.multi_index[:]

# empty result
A.multi_index[:, 0]

# empty result
A.multi_index["chr2", 0]

# ---------------------------------------------------------------------------
# IndexError                                Traceback (most recent call last)
# <ipython-input-22-20675d6deb0c> in <module>
#      12 
#      13 # IndexError: invalid index type: <class 'list'>
# ---> 14 A[[("chr1", 0), ("chr1", 1),]]
#
# tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.__getitem__()
#
# tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.subarray()
#
# tiledb/libtiledb.pyx in tiledb.libtiledb.index_domain_subarray()
#
# IndexError: invalid index type: <class 'list'>
A[[
    ("chr1", 0), 
    ("chr1", 1),
]]
ihnorton commented 3 years ago

Is there a way to keep the python kernel from crashing, even when having invalid input to tiledb?

kernel breaks with realloc(): invalid pointer

Not sure what is going on there, I will take a look and debug tomorrow.

How to do key-based indexing? (A[[("chr1", 0), ("chr1", 1),]])

I think what you want is: A.multi_index["chr1", [0,1]] which will select "chr1" on first dim, and select 0 and 1 on 2nd dim. [edit: use A.multi_index["chr1", [0,1], :, :, :]]

How to do multi_index indexing with DataFrame as return type?

A.df[<selection>] is for that -- same semantics as .multi_index (and shared implementation for the selection parsing).

ihnorton commented 3 years ago

For these:

# empty result
A.multi_index[:, 0]

# empty result
A.multi_index["chr2", 0]

# empty result
A.multi_index["chr2"]

It is a bug; I think we have a fix in progress in libtiledb core, but I will see if we can apply the substance of the following work around automatically in TileDB-Py.

RIght now you can work around it like this (place-holder : for each non-indexed dimension):

In [16]: A.multi_index["chr2", :, :, :, :]
retries: 0
Out[16]:
OrderedDict([('chrom', array([b'chr2', b'chr2'], dtype=object)),
             ('log10_len', array([1, 1], dtype=int8)),
             ('start', array([10108, 10108], dtype=int32)),
             ('alt', array([b'A', b'A'], dtype=object)),
             ('sample_id', array([b'C', b'D'], dtype=object)),
             ('end', array([10114, 10114], dtype=int32)),
             ('ref', array([b'AACCCT', b'AACCCT'], dtype=object)),
             ('GT', array([1, 1], dtype=int8)),
             ('GQ', array([60, 99], dtype=int32)),
             ('DP', array([39, 26], dtype=int32))])

Also, this works for .df:

In [11]: A.query(use_arrow=True, coords=True).df["chr2", :, :, :,:]
retries: 0
Out[11]:
  chrom  log10_len  start alt sample_id    end        ref  GT  GQ  DP
0  chr2          1  10108   A         C  10114  b'AACCCT'   1  60  39
1  chr2          1  10108   A         D  10114  b'AACCCT'   1  99  26

and this (for my suggestion):

In [14]: A.df["chr1", [0,1], :, :, :]
retries: 0
Out[14]:
  chrom  log10_len  start alt sample_id    end        ref  GT    GQ  DP
0  chr1          1  10108   A         A  10114  b'AACCCT'   1  79.0  12
1  chr1          1  10108   A         B  10114  b'AACCCT'   1   NaN   9
2  chr1          0  10143   C         B  10144       b'T'   1  65.0  34
3  chr1          0  10143   C         A  10144       b'T'   1  22.0  35
4  chr1          1  10108   A         E  10114  b'AACCCT'   1  26.0   9
5  chr1          1  10108   A         F  10114  b'AACCCT'   1  62.0   9

and this one (avoids the crash):

In [15]: A.query(use_arrow=True, coords=True).df["chr2", :, :, :, :]
retries: 0
Out[15]:
  chrom  log10_len  start alt sample_id    end        ref  GT  GQ  DP
0  chr2          1  10108   A         C  10114  b'AACCCT'   1  60  39
1  chr2          1  10108   A         D  10114  b'AACCCT'   1  99  26
Hoeze commented 3 years ago

Alright, thanks!

I also did another test for the key-value type access: grafik 30it/s is not too great, compared to RocksDB where we reach up to 40,000 it/s. Is there something I can do about?

ihnorton commented 3 years ago

Hi @hoeze, are you able to share any more details about how you are using RocksDB? I’m not very familiar with it yet, and we’d like to dig in to the comparison a bit more.

stavrospapadopoulos commented 3 years ago

Hi @Hoeze, a couple more notes on key-value queries:

  1. Not sure how you use RocksDB, but I would suggest you use a single string dimension in TileDB for such queries.
  2. You may want to tweak the tile capacity and compression for key-value queries (e.g., shrink the tile size, use no compression for local deployments).
  3. Issuing all key-value queries using multi_range will be significantly faster than single-point queries in a loop.
  4. We have some upcoming optimizations in the read algorithm that will benefit significantly even key-value queries.
  5. We have not optimized for (local) key-value performance yet, there are a couple of extra improvements we can do if this becomes critical (e.g., use mmap, optimize the tile format and read algorithm for key-value queries using hashing, etc.)

If you provided with more details, we could take a closer look and see if there are any low-hanging fruits for optimizations here.

Hoeze commented 3 years ago

Hi, I set up a RocksDB example for you. To run it, you need to download the database attached here.

```python from pathlib import Path import rocksdb class VariantDB: def __init__(self, path, rocksdb_options=None): if rocksdb_options is None: rocksdb_options = rocksdb.Options( create_if_missing=True, max_open_files=100, ) self.db = rocksdb.DB( path, rocksdb_options, read_only=True ) @staticmethod def _variant_to_byte(variant): return bytes(str(variant), 'utf-8') def _type(self, value): raise NotImplementedError() def _get(self, variant): if not variant.startswith('chr'): variant = 'chr%s' % variant return self.db.get(self._variant_to_byte(variant)) def __getitem__(self, variant): maf = self._get(variant) if maf: return self._type(maf) else: raise KeyError('This variant is not in the db') def __contains__(self, variant): return self._get(variant) is not None def get(self, variant, default=None): try: return self[variant] except KeyError: return default class VariantMafDB(VariantDB): def _type(self, value): return float(value) mafdb_path = "maf.db" mafdb = VariantMafDB(mafdb_path) variants = [ '1:10468:T>TAA', '1:10468:TCGCGG>T', "10:107494853:C>A", "10:107494857:C>A", "10:107494858:T>C", "10:107494873:C>T", "10:107494874:G>A", "10:107494905:GAGAA>G", "10:107494908:A>G", "10:107494929:T>C", "10:107494933:T>C", "10:107494935:G>A", "10:107494937:C>G", "10:107494941:CTTG>C", "10:107494942:T>A", "10:107494943:T>C", "10:107494960:G>T", "10:107494964:C>A", "10:107494979:G>A", "10:10749497:A>G", "10:107494988:T>C", "10:107494989:C>T", "10:10749498:C>T", "10:107494998:T>C", "10:10749499:G>A", "10:107495002:T>C", ] ```

Benchmark:

print(len(variants))
# 26
print([mafdb.get(v, 0) for v in variants])
# [0, 0, 0.00149653, 3.18451e-05, 3.18492e-05, 0.000223029, 0.014212, 3.18471e-05, 3.18573e-05, 3.18451e-05, 3.18634e-05, 3.18573e-05, 6.37349e-05, 3.18552e-05, 3.18431e-05, 0.000127502, 0.00350877, 3.18573e-05, 3.18471e-05, 3.18431e-05, 3.18492e-05, 0.0316323, 0.00012747, 3.18431e-05, 6.37105e-05, 3.18451e-05]
%timeit [mafdb.get(v, 0) for v in variants]
# 273 µs ± 4.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Having a very fast TileDB solution with a single String dimension would already be a great improvement to us because:

However, I believe that TileDB should also be able to improve on RocksDB speed when it is provided with some well-defined dimensions.

I hope my benchmark is of some use for you :)

stavrospapadopoulos commented 3 years ago

Thanks for for the additional information @Hoeze, this is very valuable! We do have a lot of ideas on how to boost performance for this use case, as it is much simpler than the range queries we are currently performing and our current algorithms are an overkill. We will hopefully push them to the next couple of releases. Thanks again!

nguyenv commented 2 years ago

The bugs listed have been fixed as of 0.11.0 (switching out tiledb.from_dataframe for tiledb.from_pandas).

(tiledb-3.9) vivian@mangonada:~/TileDB-Py/tiledb$ python ~/tiledb-bugs/indexing_bugs.py
Deleting array at 'test.tdb'...
Creating array at 'test.tdb'...
OrderedDict([('end', array([10114, 10114, 10114, 10144, 10144, 10114, 10114, 10114],
      dtype=int32)), ('ref', array([b'AACCCT', b'AACCCT', b'AACCCT', b'T', b'T', b'AACCCT', b'AACCCT',
       b'AACCCT'], dtype=object)), ('GT', array([1, 1, 1, 1, 1, 1, 1, 1], dtype=int8)), ('GQ', array([79,  0, 60, 65, 22, 99, 26, 62], dtype=int32)), ('DP', array([12,  9, 39, 34, 35, 26,  9,  9], dtype=int32)), ('chrom', array([b'chr1', b'chr1', b'chr2', b'chr1', b'chr1', b'chr2', b'chr1',
       b'chr1'], dtype=object)), ('log10_len', array([1, 1, 1, 0, 0, 1, 1, 1], dtype=int8)), ('start', array([10108, 10108, 10108, 10143, 10143, 10108, 10108, 10108],
      dtype=int32)), ('alt', array([b'A', b'A', b'A', b'C', b'C', b'A', b'A', b'A'], dtype=object)), ('sample_id', array([b'A', b'B', b'C', b'B', b'A', b'D', b'E', b'F'], dtype=object))])

We will be benchmarking the code soon.