Congyuwang / RocksDict

Python fast on-disk dictionary / RocksDB & SpeeDB Python binding
https://congyuwang.github.io/RocksDict/rocksdict.html
MIT License
159 stars 8 forks source link

How to get WideColumns data in " Raw Mode" from rocksdb #122

Closed fmvin closed 2 months ago

fmvin commented 2 months ago

The db is created in C++ app and contains a lot of WideColumns data. Is it possible to access these data using RocksDict? In the code example below value (v) always is 0 but keys (k) is shown as expected.

cf_lst=Rdict.list_cf(GLB_PATH)
opts=Options(raw_mode=True)
db = Rdict(path=GLB_PATH,  options=opts, column_families={cf_lst[1]: opts})
db_cf1 = db.get_column_family(cf_lst[1])
#
for k, v in db_cf1.items():
    print(f'k={k}, v size={len(v)}')

db.close()

OS: Windows 2019 Server compiler: msvc 19.39.33523 rocksdb: v9.0 RocksDict: v0.3.23

Congyuwang commented 2 months ago

Looks like WideColumns is not yet supported here, and we will need to add APIs like GetEntity and PutEntity to support it.

Congyuwang commented 2 months ago

Quoting rocksdb wide column doc:

The classic Get, MultiGet, GetMergeOperands, and Iterator::value APIs return the value of the default column when they encounter an entity, while the new APIs GetEntity, MultiGetEntity, and Iterator::columns return any plain key-value in the form of an entity with a single column, namely the anonymous default column.

Iterator returns the default value, which is empty. Needs columns() api, which is not yet supported by rocksdict yet.

Congyuwang commented 2 months ago

Just curious, what do you use WideColumns for?

For the moment, if the object is not that large, I would suggest to use some custom deserialization for the entities. The APIs related to WideColumns have not yet been explosed to C interface yet by rocksdb. So, I would need some time and wait for rocksdb to design a proper C interface for WideColumns related APIs.

Congyuwang commented 2 months ago

Related: https://github.com/facebook/rocksdb/issues/12635

fmvin commented 2 months ago

Some kind of in-memory tables with random culumn's number in each row which are being frequently updated. I have found that using WideColumns fits well with my app architecture and allowed me to easily migrate from kx kdb.

For iterating with python I'm going to create a special copies of several tables using MessagePack serializer for the entities in the way you proposed. But it is some kind of overhead.

Congyuwang commented 2 months ago

I've already drafted an up-stream PR: https://github.com/facebook/rocksdb/pull/12653

Congyuwang commented 2 months ago

Check wide_columns_raw examples with pip install rocksdict==0.3.24b1 (pypi link).

Tell me if it works 🙂.

fmvin commented 2 months ago

No success. From real db I cannot access wide columns from column family (CF). Please provide a simple example how to use the get_entity method with CF. The db itself works fine, checked it with ldb tool.

Congyuwang commented 2 months ago

I'm about to release a beta.2, which will make opening DB created by other languages (c++, java, rust) much straightforward.

fmvin commented 2 months ago

It seems I didn’t clearly explain the problem. In other words, I can’t figure out how to pass CF to the get_entity method.

Congyuwang commented 2 months ago

Ok. Try pip install rocksdict==0.3.24b2, and

from rocksdict import Rdict

# This will automatically load latest options and column families.
# Note also that this is automatically RAW MODE,
# as it knows that the db is not created by RocksDict.
db = Rdict("db_path")

# list column families
cfs = Rdict.list_cf("db_path")
print(cfs)

# use one of the column families
cf1 = db.get_column_family(cfs[1])

# iterate through all wide columns in cf1
for k, v in cf1.entities():
    print(f"{k} -> {v}")

# or query specific entity in cf1
print(cf1.get_entity(b"some_key"))

Tell me if it works.

Congyuwang commented 2 months ago

The logic of rocksdict is that, we do not pass cf argument to any of get, put, iter, get_entity, and etc.. Instead, use some_cf = db.get_column_family("some_cf_name") which returns an object with exact identical methods as Rdict including get, put, delete, get_entity, and etc. All of these operations returns only data from some_cf