discoproject / discodb

An efficient, immutable, persistent mapping object
http://discodb.readthedocs.org/
BSD 3-Clause "New" or "Revised" License
99 stars 31 forks source link

Conceptual Question #9

Open mrjjwright opened 9 years ago

mrjjwright commented 9 years ago

Hi,

I am looking for an efficient persistent immutable key value store structure to use for an application I am building for the Mac. As the user change any key value in my app I want to create a new immutable record that links back to the previous one. I don't want to rely too much on language level persistent data structures (e.g. as found in Clojure or Immutable.js) but want to work solely with data, with a guarantee of everything being persisted efficiently on disk. I like the looks of discodb (I would have to create an Objective-C wrapper around it) and am trying to grok it more. I didn't find anything in the documentation about mutating data. Is it ok and efficient to create a new version of the database each time, with I assume some naming convention for older versions?

bauman commented 9 years ago

You will pay a considerable data IO penalty and possible CPU penalty by doing so.

Data is stored randomly in the blob,. Creating a new blob using data from the old blob will require a full copy of the old blob, which will likely induce a random seek around the first blob. At minumum, you'll need to memory map the whole blob each time.

Depending on where you place the wrapper, you will be doing a lot of type casting if you build wrappers. The current python wrapper performs a cstring to python string conversion for every key and every value upon read. That is a substantially computationally intensive task compared to other tasks in the library.

A full copy (using the release build) would cast blob>cstr>String>cstr>blob for every key and and every value. you may incur additional string copies in your application depending on how you pass strings around. Python passes strings by value by default, so a python application likely has 2 additional string copies.

Point of the story, everything will be extremely efficient on random key access, but full blob copies will be slow. Only you know how often you will be doing either operation.