Open Hoeze opened 3 years ago
Is there a way to keep the python kernel from crashing, even when having invalid input to tiledb?
kernel breaks with
realloc(): invalid pointer
Not sure what is going on there, I will take a look and debug tomorrow.
How to do key-based indexing? (A[[("chr1", 0), ("chr1", 1),]])
I think what you want is: A.multi_index["chr1", [0,1]]
which will select "chr1"
on first dim, and select 0 and 1
on 2nd dim. [edit: use A.multi_index["chr1", [0,1], :, :, :]
]
How to do multi_index indexing with DataFrame as return type?
A.df[<selection>]
is for that -- same semantics as .multi_index
(and shared implementation for the selection parsing).
For these:
# empty result
A.multi_index[:, 0]
# empty result
A.multi_index["chr2", 0]
# empty result
A.multi_index["chr2"]
It is a bug; I think we have a fix in progress in libtiledb core, but I will see if we can apply the substance of the following work around automatically in TileDB-Py.
RIght now you can work around it like this (place-holder :
for each non-indexed dimension):
In [16]: A.multi_index["chr2", :, :, :, :]
retries: 0
Out[16]:
OrderedDict([('chrom', array([b'chr2', b'chr2'], dtype=object)),
('log10_len', array([1, 1], dtype=int8)),
('start', array([10108, 10108], dtype=int32)),
('alt', array([b'A', b'A'], dtype=object)),
('sample_id', array([b'C', b'D'], dtype=object)),
('end', array([10114, 10114], dtype=int32)),
('ref', array([b'AACCCT', b'AACCCT'], dtype=object)),
('GT', array([1, 1], dtype=int8)),
('GQ', array([60, 99], dtype=int32)),
('DP', array([39, 26], dtype=int32))])
Also, this works for .df
:
In [11]: A.query(use_arrow=True, coords=True).df["chr2", :, :, :,:]
retries: 0
Out[11]:
chrom log10_len start alt sample_id end ref GT GQ DP
0 chr2 1 10108 A C 10114 b'AACCCT' 1 60 39
1 chr2 1 10108 A D 10114 b'AACCCT' 1 99 26
and this (for my suggestion):
In [14]: A.df["chr1", [0,1], :, :, :]
retries: 0
Out[14]:
chrom log10_len start alt sample_id end ref GT GQ DP
0 chr1 1 10108 A A 10114 b'AACCCT' 1 79.0 12
1 chr1 1 10108 A B 10114 b'AACCCT' 1 NaN 9
2 chr1 0 10143 C B 10144 b'T' 1 65.0 34
3 chr1 0 10143 C A 10144 b'T' 1 22.0 35
4 chr1 1 10108 A E 10114 b'AACCCT' 1 26.0 9
5 chr1 1 10108 A F 10114 b'AACCCT' 1 62.0 9
and this one (avoids the crash):
In [15]: A.query(use_arrow=True, coords=True).df["chr2", :, :, :, :]
retries: 0
Out[15]:
chrom log10_len start alt sample_id end ref GT GQ DP
0 chr2 1 10108 A C 10114 b'AACCCT' 1 60 39
1 chr2 1 10108 A D 10114 b'AACCCT' 1 99 26
Alright, thanks!
I also did another test for the key-value type access: 30it/s is not too great, compared to RocksDB where we reach up to 40,000 it/s. Is there something I can do about?
Hi @hoeze, are you able to share any more details about how you are using RocksDB? I’m not very familiar with it yet, and we’d like to dig in to the comparison a bit more.
Hi @Hoeze, a couple more notes on key-value queries:
multi_range
will be significantly faster than single-point queries in a loop.mmap
, optimize the tile format and read algorithm for key-value queries using hashing, etc.)If you provided with more details, we could take a closer look and see if there are any low-hanging fruits for optimizations here.
Hi, I set up a RocksDB example for you. To run it, you need to download the database attached here.
Benchmark:
print(len(variants))
# 26
print([mafdb.get(v, 0) for v in variants])
# [0, 0, 0.00149653, 3.18451e-05, 3.18492e-05, 0.000223029, 0.014212, 3.18471e-05, 3.18573e-05, 3.18451e-05, 3.18634e-05, 3.18573e-05, 6.37349e-05, 3.18552e-05, 3.18431e-05, 0.000127502, 0.00350877, 3.18573e-05, 3.18471e-05, 3.18431e-05, 3.18492e-05, 0.0316323, 0.00012747, 3.18431e-05, 6.37105e-05, 3.18451e-05]
%timeit [mafdb.get(v, 0) for v in variants]
# 273 µs ± 4.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1. Not sure how you use RocksDB, but I would suggest you use a single string dimension in TileDB for such queries.
This RocksDB is very simple: its key is a string and the value a float. However, this also means that RocksDB has no idea about the data structure.
- Using the variant schema from my first post (without
sample_id
,log10_len
and only a single attribute), TileDB should have a great chance to beat RocksDB since ordererd requests are always hitting the same fragment.- Also, keeping the different dimensions allows for range queries ("get all variants on chr1", etc.)
2. You may want to tweak the tile capacity and compression for key-value queries (e.g., shrink the tile size, use no compression for local deployments).
Thanks for the hint. I tried without any compression filters but I'm still reaching only ~50it/s.
3. Issuing all key-value queries using `multi_range` will be significantly faster than single-point queries in a loop.
Yes, but I think this is only possible with a single key, right? I.e. querying at once
[("chr1", 0, 10108, "A"), ("chr1", 1, 10143, "C"),]
will not work I believe?
Having a very fast TileDB solution with a single String dimension would already be a great improvement to us because:
However, I believe that TileDB should also be able to improve on RocksDB speed when it is provided with some well-defined dimensions.
I hope my benchmark is of some use for you :)
Thanks for for the additional information @Hoeze, this is very valuable! We do have a lot of ideas on how to boost performance for this use case, as it is much simpler than the range queries we are currently performing and our current algorithms are an overkill. We will hopefully push them to the next couple of releases. Thanks again!
The bugs listed have been fixed as of 0.11.0 (switching out tiledb.from_dataframe
for tiledb.from_pandas
).
(tiledb-3.9) vivian@mangonada:~/TileDB-Py/tiledb$ python ~/tiledb-bugs/indexing_bugs.py
Deleting array at 'test.tdb'...
Creating array at 'test.tdb'...
OrderedDict([('end', array([10114, 10114, 10114, 10144, 10144, 10114, 10114, 10114],
dtype=int32)), ('ref', array([b'AACCCT', b'AACCCT', b'AACCCT', b'T', b'T', b'AACCCT', b'AACCCT',
b'AACCCT'], dtype=object)), ('GT', array([1, 1, 1, 1, 1, 1, 1, 1], dtype=int8)), ('GQ', array([79, 0, 60, 65, 22, 99, 26, 62], dtype=int32)), ('DP', array([12, 9, 39, 34, 35, 26, 9, 9], dtype=int32)), ('chrom', array([b'chr1', b'chr1', b'chr2', b'chr1', b'chr1', b'chr2', b'chr1',
b'chr1'], dtype=object)), ('log10_len', array([1, 1, 1, 0, 0, 1, 1, 1], dtype=int8)), ('start', array([10108, 10108, 10108, 10143, 10143, 10108, 10108, 10108],
dtype=int32)), ('alt', array([b'A', b'A', b'A', b'C', b'C', b'A', b'A', b'A'], dtype=object)), ('sample_id', array([b'A', b'B', b'C', b'B', b'A', b'D', b'E', b'F'], dtype=object))])
We will be benchmarking the code soon.
Hi @ihnorton, sorry to bother once more, but I think I found a couple of bugs in conjunction with indexing.
Setup:
Now my trials:
A[[("chr1", 0), ("chr1", 1),]]
)multi_index
indexing with DataFrame as return type?