adewes / blitzdb

Blitz is a document-oriented database for Python that is backend-agnostic. It comes with a flat-file database for JSON documents and provides MongoDB-like querying capabilities.
http://blitzdb.readthedocs.org
MIT License
330 stars 37 forks source link

proof of concept for BTree index #12

Closed jxieeducation closed 10 years ago

jxieeducation commented 10 years ago

This is an extremely simple proof of concept for https://github.com/adewes/blitzdb/issues/9

This allows the backend to adopt a BTree model using the BTrees pip library (sudo pip install BTrees, did not add to the requirements.txt yet). While attempting to add the BTree, I found a few problems with using a BTree.

  1. querying the whole set: if user tries to store a tuple, it's easy to add both elements to the hash. however, with trees, it's much more complicated.
  2. search: if an object's built in compare function is wrong / raises an error, the BTree can't do anything about it, because it needs to compare the elements. It's easy to make a BTree of the hashes of the elements, however, that pretty much defeats the purpose of using a BTree in the first place.

While writing the simple add / remove functions, I assumed that caching is not needed for BTrees, because it offers little adv. as the tree goes down. I also believe a BTree will also be clunky when elements get large, because it runs in O(n), while the hashmap runs in O(1).

Please let me know what you think?

adewes commented 10 years ago

Hey @jxieeducation ,

first of all thanks for your contribution, this is really great! Here are my thoughts:

1) I had a look at the BTrees documentation and it seems to rely heavily on ZODB and its persistence module, so I'm not entirely sure if it's the right thing to use here. Basically, I think we want a BTree that can be incrementally written and read from disk (to avoid loading it into memory when the database starts up) 2) BTrees allow lookup (as well as insertion and deletion) operations in O(log n) time (http://en.wikipedia.org/wiki/B-tree), as opposed to O(1) for lookup in hash tables. The main advantage of a BTree is that it can be written and read incrementally, thus without loading its entire content into memory first, and that range queries can be performed efficiently. 3) Like with the normal file-based index, we need to insert key/value pairs into the BTree to be able to do rich comparisons between values, search operations and range queries. I think the update function of OOBTree allows for storing key/value pairs in the tree.

Summarizing, I think it would be interesting to see if there's a more lightweight BTree library available for Python or if we can even implement a simple BTree structure ourselves. Feel free however to further investigate the OOBTrees package, if you do I would be interested to see how exactly we would insert key/value pairs and persist the tree on disk. So it would be great if you could write a simple demo that defines a fully functioning BTree-based index class and contains some unit tests that we can run to check its correctness.

Good work!

jxieeducation commented 10 years ago

Thanks, I totally agree that this is not the library that we are looking for. Will be coming up with a simple BTree then.