erthink / libmdbx

One of the fastest embeddable key-value ACID database without WAL. libmdbx surpasses the legendary LMDB in terms of reliability, features and performance.
https://erthink.github.io/libmdbx/
Other
1.16k stars 111 forks source link

Python binding #147

Open guozanhua opened 3 years ago

erthink commented 3 years ago

PR is welcome!

erthink commented 3 years ago

Please use the new C++ API as a basis/template.

Thermi commented 3 years ago

I started work using pybind11.

erthink commented 3 years ago

I started work using pybind11.

But, please, take look to the morern С++ API.

Thermi commented 3 years ago

I'm doing that.

Thermi commented 3 years ago

I need some help with writing the Type Caster from mdbx::slice to byte[] and other Python types. The idea is to convert to Python types early, because Python provides native methods for slicing arrays and anything that mdbx::slice provides. At the moment, writing the type caster still goes over my head. I uploaded the code in my repo in an own branch. Any help with writing it is greatly appreciated! The code is at https://github.com/Thermi/libmdbx/tree/python-bindings , contained in py_mdbx.cpp. The command for building is in build_py_bindings.sh. The code is not supposed to be mergable right now. It's just quick and dirty for developing the bindings. Once I got that all figured out, I'll implement the logistics for integrating it into cmake and the GNUMakefile.

EDIT: build_py_bindings.sh references system libraries instead of the ones in the WD because I got libmdbx installed as a package and the headers there are the same as the ones I am currently developing with. No worries. I can change that if desired.

erthink commented 3 years ago

As a traditional key-value store libmdbx handles a keys and values in terms of byte sequences, which in some cases treated as a uint32_t or uint64_t instances. This assumes that the user's data is somehow serialized during placed in the database and deserialized while read.

In most cases, simple byte-by-byte copying of field values from stored data structures is used for serialization. This is not a problem for C / C++, where data is very often handled in this way, and therefore mdbx::slice is enough as a basis. But for python (and many other languages with dynamic and/or object type system) an explicit customizable and/or univerlas fully-fledged serialization is required.

Python has a built-in universal serialization, and it is definitely worth using. However, this is sure not enough, since we could only write / read data from the python environment, and in all other cases it will be extremely difficult. I think we should provide some mechanism that allows to use different methods of serialization, for instance: python native, message pack, custom, etc. But I'm only familiar with python at a basic level, so I don't know how best to implement it technically.

Thermi commented 3 years ago

Alright, I got that problem fixed. I decided on passing byte arrays to Python as early as possible by and using the native tuple class for where mdbx slices are used to return a pair of pointers with sizes.

Thermi commented 3 years ago

I made the design decision to provide a pythonized API in the main package and the underlying libmdbx API using pybind11 so internals are accessible to users wishing to use those instead. It also makes it much easier to implement the pythonized API because I can maintain large parts of the code in Python instead of in C++.

Thermi commented 3 years ago

I had to do it using the C API and that worked thus far. Here's the code: https://github.com/Thermi/libmdbx/tree/python-bindings. Check python/ctypes/

The directory will change at some point. This is done so I can locally distinguish the code easily between the different implementation tries.

erthink commented 3 years ago

Thanks to @Thermi, this feature is 80% done. Please try the python-bindings branch and take look the TODO list.

I will continue review the code (so the TODO-list will be updated) and will try to find some time to work on these TODO items.

Thermi commented 3 years ago

I have some comments on the items in the TODO list.

Q: does the python/libmdbx/ subdirectory is required instead of just python/ ?

It makes it easy to deal with and keep module specific files seperate from, like example, your TODO list. :P

Q: does names like MDBXFoo is reasonable instead of using namespace(s) or MDBX_Foo ?

Yes. It follows the Python convention to use CamelCase for class names and snake case for function names. (PEP 8,

Q: could we use some automation to update, synchronize and/or check definitions and doxygen-comments ?

I think that's kind of a premium item in that the libmdbx API is supposed to be stable and thus also the documentation should be fine once manually checked once.

cmake: replace using sed by CMake's configure_file().

I did that because that way it's consistent with what's in GNUmakefile

integration: provide & fix building from both amalgamated and non-amalgamated sources.

It's broken right now when built from amalgated sources?

integration: manually check & fix all building variants with all libmdbx cases (static, solib/dll, windows-without-crt, with/without LTO).

Shouldn't matter much because only the DLL is relevant for the bindings, and it evidently has to be compatible with python. That's then a python ctypes problem, not an mdbx problem.

integration: manually check & fix building by: clang 4.0..latest, gcc 4.8..latest.

Same as for 4. Python problem.

ci: provide scripts for checking builds on Windows and FreeBSD.

The existing python unittest should be OS agnostic, so execution should work OOTB. Do we have CI integration with Windows and bsd yet?

integration: manually check & fix for Buildroot (i.e. cross-compilation) with all python-related options.

Shouldn't matter unless the DLL and python path are not adjusted (because then the DLL is in a different architecture from the python binaries)

integration: manually check & fix for Solaris/OpenIndiana, OpenBSD, NetBSD, DragonflyBSD, Android, iOS.

Pure python topic again, because Python itself should be OS agnostic on that layer.

code: fix a hardcoded platform-depended error codes (MDBX_ENODATA, MDBX_EINVAL, MDBX_EACCESS, MDBX_ENOMEM, MDBX_EROFS, MDBX_ENOSYS, MDBX_EIO, MDBX_EPERM, MDBX_EINTR, MDBX_ENOFILE, MDBX_EREMOTE, etc ?).

What would be the advantage? It should be easy for devs to write code to catch the different error codes, but if we introduce OS specific error codes that might not be defined on all platforms, the devs will need to check for their existence/declaration using hasattr() and/or using try-catch blocks for the code dealing with them. That's not really useful. The posix error codes won't change ever, so them being statically defined right now should be fine,. Adding the NT error codes as alternative values for their posix correspondents should be feasible.

ci: add checking within CentOS 6.x .. latest (this is also applicable for mainstream).

That will just require some more distro specific code dealing with yum and the package names there CentOS 6 is also EOL already, better try CentOS 7 instead.

erthink commented 3 years ago

cmake: replace using sed by CMake's configure_file().

I did that because that way it's consistent with what's in GNUmakefile

Using sed - is a unix/Makefile way, but it shouldn't be used from CMake, since the CMake was created for this.

ci: provide scripts for checking builds on Windows and FreeBSD. integration: manually check & fix for Solaris/OpenIndiana, OpenBSD, NetBSD, DragonflyBSD, Android, iOS.

In these and many other cases we should verify the building script/process itself, not the functionality of library nor bindings. This may be a challenge in some environments, i.e. buildroot or cross-compilation.

code: fix a hardcoded platform-depended error codes...

What would be the advantage? ...

There two major points:

  1. POSIX defines the symbolic names (C preprocessor macros/defines) for an errors, but not the one's values. Moreover, some errors are optional, and the sets of such errors differ from system to system. So you shouldn't hardcode anything of that, but (at least) import from system's includes (pay attention for cross-compiling cases).
  2. In many cases, various system errors can occur, which depend on platform and it is no way to enumerate it. So there no other solution rather than just return (from the library API) the system error codes as is, i.e. it is unreasonable to provide yet another abstraction layer for error codes. However, there are also many situations where need to return error codes for which the system already defines specific codes (e.g. EINVAL, ENOMEM, etc). So it is reasonable to reuse these system codes instead of to introduce new ones and force users to handle more of peculiar error codes.

ci: add checking within CentOS 6.x .. latest (this is also applicable for mainstream).

That will just require some more distro specific code dealing with yum and the package names there CentOS 6 is also EOL already, better try CentOS 7 instead.

CentOS6 is actually a nickname (or a free pack) of RHEL6, the EOL of which was scheduled to 2025 (and may be prolonged).

Thermi commented 3 years ago

About CentOS 6: It's EOL since November 30th https://wiki.centos.org/About/Product

erthink commented 3 years ago

About CentOS 6: It's EOL since November 30th https://wiki.centos.org/About/Product

For clarification: I mean we should provide support for RHEL6, but may use CentOS6 for this.

erthink commented 3 years ago

Just in case, there is a simple, generally accepted solution for an system-aware error codes:

Thermi commented 3 years ago

I just found the errno module. It exports all errors. I'll make a PR for using it and change all the currently hardcoded OS specific errors.

erthink commented 3 years ago

I just found the errno module. It exports all errors. I'll make a PR for using it and change all the currently hardcoded OS specific errors.

OK, but note that the above solution avoids copy&paste some boilerplate "magic".

Thermi commented 3 years ago

The above solution is not maintainable because ctypes has no C code parser. If that was the case, I wouldn't have needed to declare all the functions. :P

Thermi commented 3 years ago

See #213 for the declaration using errno instead of explicite values