Consider atomic rather than row data

dat-ecosystem / dat

:floppy_disk: peer-to-peer sharing & live syncronization of files via command line

https://dat.foundation

BSD 3-Clause "New" or "Revised" License

8.24k stars 450 forks source link

Consider atomic rather than row data #10

Closed Spikels closed 8 years ago

Spikels commented 11 years ago

A fundamental issue for any "git for data" is the core data structure used to store the data itself. It appears that you intend to use rows as your fundamental unit of data and are thinking of data as being "tabular". I may have misunderstood your plans.

While this is perfectly reasonable and even normal I would like to suggest that you consider using "atomic" data as your fundamental unit. For example with numerical data your unit of data would be a single number that is indexed to uniquely identify it rather than an entire row of numbers with a corresponding set of indexes.

While this many seem like a minor change it has many implications, both positive (power) and negative (complexity), as the entire system is built on top of this data structure. Git, at least as it appears to the user, is built on files and in particular lines of files (diffs). While this works great for code it is terrible for data. Change a single number and the entire row is flagged as changed leaving you to search for the change. Add a column and the entire file is changed. Even if all the data is exactly the same but formatting is changed slightly it is as if you are starting from scratch. At least for me this is what makes git unusable for data versioning (yes I know there are configuration settings that help a bit).

If dat is indeed fundamentally row based it will still suffer from many of the problems that make git so poorly suited for data. If instead dat thought of data as atomic units of indivisible information (a number being the cleanest example), data versioning would only consist of tracking changes (i.e. additions, deletions or updates) to the data itself. This could potentially solve all of these issues by breaking the dependence on the format (i.e. representation) of the data.

This is a quite complex issue with many subtle and not so subtle implications - a full discussion would be many pages. I hope this rather abstract comment will trigger some consideration of how best to both structure and think about the underlying data that dat will store. I would be glad to help work thru the many issues involved and perhaps at the end of the day rows are the best way to go.

This is an important but difficult project. Thanks for undertaking it. You have my full support. I hope I can be of help.

dominictarr commented 11 years ago

I was just talking to @maxogden and suggested exactly the same idea. We where discussing the possiblity of replicating a data set to multiple locations, and accepting writes in parallel, and then merging those writes back in. representing data at the cell level would be helpful here, as each node could add a new cell, without causing a conflict.

Also, storing data as cells would be much more flexible if columns are sparse.

augusto-herrmann commented 11 years ago

That's exactly what I thought when @rgrp suggested using Git for data. If you add, transform or delete a column, the entire file would be flagged as changed. Row-based diffs would almost useless for tracking data provenance. So +1 for using "atomic", or cell-based, units for tracking changes to data.

Mr0grog commented 11 years ago

Not that I entirely disagree with the notion here, but I think there are powerful counter-examples to this, for example: http://kssreeram.org/posts/2013/07/12/beware-of-sync-platforms.html

This could potentially solve all of these issues by breaking the dependence on the format (i.e. representation) of the data.

I think it is worth keeping in mind that, as often as we'd like to think the format/representation of data and the values themselves are independent things, they aren't always. The point about Git being successful by being fundamentally based around lines is actually a great example of how right way to treat a particular type of data can be extremely domain-specific.

Again, I'm not saying that making cells the fundamental atom of change is a bad idea. It's just that right way to track and treat changes to data can be extremely domain specific. There are myriad cases where independent changes to different parts of a row could be in conflict. Maybe there is room for a data set to define one of several levels on which to track changes or possibly even a way to package up a custom merge strategy (that's possibly getting too complex, but it doesn't seem like an especially uncommon need).

rufuspollock commented 11 years ago

@augusto-herrmann I should note that you can get diffs that highlight the sections of a line that have changed. That said the whole reason my post went on about "line-oriented" data formats was so that line based diffing and merging worked reasonably well (and there was a lengthy appendix about situations where it would not work well). However, on to the main topic here which, I think, is:

What is the best strategy for partitioning data to effectively do diff and merge

The suggestion of atomizing at the right level is definitely correct but unfortunately it would seem that level would change from application to application - and your merge protocol will probably need to change (something that @Mr0grog points out).

What seems to be being focused on here is what is the structure of your changeset e.g. is a list of changed rows or changed cells. But there are a lot of other features to consider - cf http://www.dataprotocols.org/en/latest/revisioning-data.html. E.g. do you need to know parent changesets (are you trying to do distributed revisioning), how does your merge protocol handle those changesets, do you need to optimize for size of the revisioned DB, efficiency of processing or something else ...

paulfitz commented 11 years ago

@augusto-herrmann Git's default line-based diff+merge algorithms can be replaced with ones that are column-aware (at least for visualization). Github could choose tomorrow to render csv diffs as something like http://paulfitz.github.io/coopyhx/ or individuals could choose to switch their diff/merge drivers locally to something more convenient for data (see http://share.find.coop/doc/tutorial_git.html for a proof of concept). I agree with @Mr0grog and @rgrp that merge is application dependent. The most helpful diffs/merges may for example be at the "model" level, one level of abstraction removed from tables, where relationships between a set of tables are understood - and this is quite doable when we have a schema. Not sure how relevant diffing+merging is to dat though, so far it is feeling more focused on unidirectional flow with transformation?

guiprav commented 11 years ago

@paulfitz I guess unidirectional flow implies centralization (single remote setting). When you have multiple remotes, you have to care about diffing & merging, right?

On Tue, Jul 16, 2013 at 5:03 PM, paulfitz notifications@github.com wrote:

@augusto-herrmann https://github.com/augusto-herrmann Git's default line-based diff+merge algorithms can be replaced with ones that are column-aware (at least for visualization). Github could choose tomorrow to render csv diffs as something like http://paulfitz.github.io/coopyhx/ or individuals could choose to switch their diff/merge drivers locally to something more convenient for data (see http://share.find.coop/doc/tutorial_git.html for a proof of concept). I agree with @Mr0grog https://github.com/Mr0grog and @rgrphttps://github.com/rgrpthat merge is application dependent. The most helpful diffs/merges may for example be at the "model" level, one level of abstraction removed from tables, where relationships between a set of tables are understood - and this is quite doable when we have a schema. Not sure how relevant diffing+merging is to dat though, so far it is feeling more focused on unidirectional flow with transformation?

— Reply to this email directly or view it on GitHubhttps://github.com/maxogden/dat/issues/10#issuecomment-21069415 .

dominictarr commented 11 years ago

A common misconception about git (one that I had for a long time) is that git only stores the changes. not true. git stores snapshots of an entire file, and only uses diffs when it's packing files to be sent (pulled/pushed), or stored with less space.

If you dataset scales into something larger than a text file - where a few thousand lines is considered very large - then you are gonna want much finer grained units.

maybe cells, maybe rows... what about "cell groups"? could have multiple cells, but you could combine 1 or more cell groups to make a row...

max-mapper commented 11 years ago

I'm in the implementation phase of dat now and am starting to prototype the initial revision/versioning system. Here's a diagram representing what atomic (e.g. 'cell') level versioning looks like:

screen shot 2013-08-26 at 11 58 52 am

On disk a cell would look something like this:

“foo-1-A-1-7dd4b006883c4a491964ba3f6e1885c0” = {”type”: “number”, “value”: “12345”}

the key format is table-row-cell-version-md5ofvalue

this will work for queries like "get the most recent version of each cells in each row". also if we store metadata for transformations e.g. "this transformation changed cell A in row 5 from version 2 to version 3" then we could have "undo" functionality and show the data before and after a transform in real-time.

to serialize a table like this as a CSV would look like this:

A,B,C,A_rev,B_rev,C_rev
12345,12345,12345,1-7dd4b006883c4a491964ba3f6e1885c0,1-7dd4b006883c4a491964ba3f6e1885c0,1-7dd4b006883c4a491964ba3f6e1885c0

versus a row being the atomic unit:

A,B,C,_rev
12345,12345,12345,1-a80c33bd5fc142d773c9e41a5a43cd6a

food for thought. the benefit of the md5sum is mostly an optimization -- so you can check if data being written is different from the data on disk without having to read the full body of the cell and md5 it on every write. if we end up not storing md5s we will still probably need document level revisions e.g.

A,B,C,A_rev,B_rev,C_rev
12345,12345,12345,1,1,1

max-mapper commented 11 years ago

more on revisions (my current thinking, subject to change, feedback is encouraged):

at the moment if you type dat cat inside a dat repo you get this:

{"_id":"a","val":"bar","_seq":2}
{"_id":"b","val":"bar","_seq":3}
{"_id":"c","val":"bar","_seq":4}
{"_id":"d","val":"bar","_seq":5}

each object is a row (in this example there is only one column), _seq is used internally for replication and _id is the key. I probably won't even include the _seq in future versions.

I'd like transforms to work something like this:

dat cat | some_transform.py | dat

meaning dat cat will write line separated json objects - the newest version of each row - to stdout. the transform script will do whatever it wants to the data and write it back to the stdin of dat.

the revisions would work something like this: given a row at revision 1:

{"_id":"a","val":"bar","_seq":2, "_rev": "1-67d900b126089a27be593499c79d8743"}

if it gets passed through an uppercase transform the transform should return this (transforms should never change underscore prefixed keys):

{"_id":"a","val":"BAR","_seq":2, "_rev": "1-67d900b126089a27be593499c79d8743"}

then when dat receives the object it will handle updating the revision and store this:

{"_id":"a","val":"BAR","_seq":6, "_rev": "2-fd6ad7dd8fc4957375b1748836a83278"}

at the point when dat increments the revision it can choose whether or not to delete the old version based on repo level config.

audy commented 11 years ago

I agree that using line-based diffs on tabular data doesn't work. It would if you melted the table to row_id,column_id,value

For example,

cats,dogs
my_house,5,0
your_house,3,1

becomes

my_house,cats,5
my_house_dogs,0
your_house,cats,3
your_house,dogs,0

This would allow for line-based diffs on big tables as well as sparse tables.

max-mapper commented 11 years ago

also I'm not interested in having dat include any baked in merge strategies, I view that as an application level problem. I just wanna make sure the data model and APIs are flexible enough to let users define the behavior they'd like to see for their specific use cases

whereas couchdb stores conflicts and diverging branches of documents, I think that is too complicated and prefer a linear revision history (see also https://github.com/mikeal/couchup#couchup)

e.g. if you could define an "update transform" that gets the old version (or even all previous versions) and the new version of a row whenever a row is updated I think it would be a good start.

okdistribute commented 8 years ago

After a whole year of feedback from the community, we recently published a new version of dat, which focuses on binary data. Would you all mind trying out the new dat with npm install -g dat? You can read about the new dat announcement on the website and how it works in the docs.