ship the dat beta - Githubissues

max-mapper commented 10 years ago

we're shooting for a dat beta release in april 2015. note that this doesn't include all the stuff in the overall project roadmap

here's a link to the list of issues for the beta milestone

These are the repos where most of the work will be happening:

https://github.com/maxogden/dat-core (core implementation/JS interface)
https://github.com/maxogden/dat/tree/beta (command line interface)
https://github.com/maxogden/dat-server (http server interface)

We are seeking feedback on our Beta APIs:

CLI: https://github.com/maxogden/dat/blob/beta/beta-cli-api.md JS: https://github.com/maxogden/dat/blob/beta/beta-js-api.md

use this thread to discuss anything that doesnt fit in one of the issues above

ekg commented 9 years ago

@mafintosh @maxogden and I had a conversation a few days ago about data models for the beta release. We're interested in satisfying a number of constraints in order to support several important features. A few stand out:

multi-master: In other words, pulling multiple data sources into the same repository. This is conceptually equivalent to supporting branches, but the default branch of each dat might have a particular UUID. When pulling data into the same dat, we'd possibly like to track its origin or branch using this UUID.
checkout/rollback: In some situations (particularly in science), we'd like to reproduce a particular analysis precisely as it would have happened at a given point in the history of the data, or at a particular point in time. I registered a PR against level-dat to accomplish efficient rollback, but my approach now seems too complicated--- it should be enough to simply remove the version numbers, and use change index ids instead. The version numbers can be regenerated as needed.
data integrity: It can also be important to quickly verify that certain subsets of the data in different dats are identical. This problem will become more pressing as repositories grow in size and also as the data in them becomes more sensitive. @mafintosh has been investigating the use of Merkle trees to efficiently guarantee the integrity of a given dataset. As an added benefit, this approach can enable efficient synchronization of dats, because the hashes in the tree can be used to rapidly determine the point at which different versions of the data differ.
forking and merging of data histories: If we enable multiple sources of data to be merged into a single dat, then the possibility of conflict arises. That is, only if we believe in conflict! Another model, demonstrated in forkdb, allows "conflicts" to exist until resolution via explicit merging by the operator of the repository. Allowing multiple instances of the same object only poses an issue if our data table is not namespaced by origin (which master in the "multiple-masters" did we pull it from?).

I'll go over the rough implications of these for the model so that the proposed changes can be properly discussed and grokked. First, it's worth describing the data model as it stands in dat alpha.

As of 68ae983c7, the data model in level-dat (dat's default backend) consists of three components.

The current table: this is where the current state of the data can be quickly found. dat keeps reference to the current version of every key in the data table here. If you dat cat, dat will use this table to quickly return the current version of the data.
The data table: this is where data lives. The primary key picked on import is used to generate a key, but the key also stores the version number of the object, which is the minimum that is required for version control of objects with the same key. The current format allows for namespaces. Only the default ('') and "internalschema" appear to be used currently.
The log or change table: this is where dat records the series of changes that occur to the objects (rows) in the data table.

table	key	value
current	+c +namespace +key	version
data	+d +namespace +key +version	object (encoded in protobuf format)
log/change	+s +log_id	[log_id, key, from_version, to_version, namespace]

Note that "+" is "ÿ" (aka \xff). You might see this if you're checking out the format via something like superlevel .dat/store.dat createReadStream.

The four constraints (multi-master, checkout/rollback, data integrity and efficient synchronization, and forking/merging) suggest a particular set of changes. One approach would be as follows:

table	key	value
current	+c +namespace +key	log_id
data	+d +namespace +key +log_id +parent_hash +branch	object (encoded in protobuf format)
log/change	+s +log_id	[log_id, key, [parent_log_ids,...], [parent_hashes,...], hash(new+[parent_hashes]), ns, branch]

Appending the log_id in place of the version in the data table allows us to quickly rollback, and also the last key in ordered data stores (like leveldb) will correspond to the most-recently-seen version. A coherent sequence of versions can be generated over the ordered history for each object rather than stored and directly-manipulated by users.

We extend the key space to include branches, but it isn't clear to me where these should fall in the keys. For instance, they could be appended behind the namespace, but would limit the number of branches/masters that could be handled efficiently because each lookup would require appending all of the branches/masters that the repository was aware of.

We record the parent_hash (the preceeding hash in the Merkle tree for this object) as well as the source branch (or repo UUID) for each entry in the table. By recording the hashes in the Merkle tree, as well as the local log_ids of the parents of each object, the change log can be used to traverse the series of forks and merges through which each object's history passes.

What am I missing? How does this break? We're not going to be able to solve this without implementing and testing things, but hopefully a little discussion can get us closer to something generally optimal before a lot of time goes into testing various options.

max-mapper commented 9 years ago

just an update, major work is underway now in both http://github.com/maxogden/dat-core and https://github.com/maxogden/dat/tree/beta. we hope to release in the new couple of weeks

sckott commented 9 years ago

:rocket:

webmaven commented 9 years ago

:shipit:

joshmarinacci commented 9 years ago

any updates?

max-mapper commented 9 years ago

beta branch has been merged into master branch. still some work to do before a release, but for now you can npm install dat@7.0.0-pre to test out

dat-ecosystem / dat

ship the dat beta #195