Closed max-mapper closed 9 years ago
@mafintosh @maxogden and I had a conversation a few days ago about data models for the beta release. We're interested in satisfying a number of constraints in order to support several important features. A few stand out:
I'll go over the rough implications of these for the model so that the proposed changes can be properly discussed and grokked. First, it's worth describing the data model as it stands in dat alpha.
As of 68ae983c7, the data model in level-dat (dat's default backend) consists of three components.
dat cat
, dat will use this table to quickly return the current version of the data.table | key | value |
current | +c +namespace +key |
version |
data | +d +namespace +key +version |
object (encoded in protobuf format) |
log/change | +s +log_id |
[log_id, key, from_version, to_version, namespace] |
Note that "+" is "ÿ" (aka \xff). You might see this if you're checking out the format via something like superlevel .dat/store.dat createReadStream
.
The four constraints (multi-master, checkout/rollback, data integrity and efficient synchronization, and forking/merging) suggest a particular set of changes. One approach would be as follows:
table | key | value |
current | +c +namespace +key |
log_id |
data | +d +namespace +key +log_id +parent_hash +branch |
object (encoded in protobuf format) |
log/change | +s +log_id |
[log_id, key, [parent_log_ids,...], [parent_hashes,...], hash(new+[parent_hashes]), ns, branch] |
Appending the log_id in place of the version in the data table allows us to quickly rollback, and also the last key in ordered data stores (like leveldb) will correspond to the most-recently-seen version. A coherent sequence of versions can be generated over the ordered history for each object rather than stored and directly-manipulated by users.
We extend the key space to include branches, but it isn't clear to me where these should fall in the keys. For instance, they could be appended behind the namespace, but would limit the number of branches/masters that could be handled efficiently because each lookup would require appending all of the branches/masters that the repository was aware of.
We record the parent_hash (the preceeding hash in the Merkle tree for this object) as well as the source branch (or repo UUID) for each entry in the table. By recording the hashes in the Merkle tree, as well as the local log_ids of the parents of each object, the change log can be used to traverse the series of forks and merges through which each object's history passes.
What am I missing? How does this break? We're not going to be able to solve this without implementing and testing things, but hopefully a little discussion can get us closer to something generally optimal before a lot of time goes into testing various options.
just an update, major work is underway now in both http://github.com/maxogden/dat-core and https://github.com/maxogden/dat/tree/beta. we hope to release in the new couple of weeks
:rocket:
:shipit:
any updates?
beta branch has been merged into master branch. still some work to do before a release, but for now you can npm install dat@7.0.0-pre
to test out
we're shooting for a dat beta release in april 2015. note that this doesn't include all the stuff in the overall project roadmap
here's a link to the list of issues for the beta milestone
These are the repos where most of the work will be happening:
We are seeking feedback on our Beta APIs:
CLI: https://github.com/maxogden/dat/blob/beta/beta-cli-api.md JS: https://github.com/maxogden/dat/blob/beta/beta-js-api.md
use this thread to discuss anything that doesnt fit in one of the issues above