How does tabular data fit into the current roadmap?

flyingzumwalt commented 8 years ago

Some clarifying questions about the current work on Containers and its relationship to simpler use cases around tabular data.

My angle: DVC of Tabular Data

I’m interested in distributed version control of tabular data. Key interactions I’m interested in are

importing tabular data into a dat repository, either providing row keys or allowing dat to auto-generate them [dat import]
track commits over time, including committer & author metadata about those commits [equivalent: git log]
create & track branches within a repository [equivalents: git branch]
merge branches or throw them away [equivalents: git merge and git branch -d]
throw away bad commits & branches (ie. rewind the history by 3 commits, throwing away the intermediate commits) [equivalents: git reset and git rebase]
diff two forks of a repository, showing who changed applied which changes to which rows [equivalent: git blame]

Note: By listing git equivalents here I’m not necessarily suggesting that you should exactly replicate git’s cli API. It’s just what I’m familiar with. dat already (for good reasons) imitates git in a number of ways, so it makes sense to provide those examples.

Question: How does tabular data fit into the current roadmap around Containers?

In my understanding of the current conversations, when you say “container” you are talking about things that contain/represent an instance of an operating system, or a portion of one, with a specific configuration and sometimes a specific state. Example: a docker container.

I don’t have use cases that require version control of containers. I'm primarily concerned with DVC of tabular data and I think dat is currently the best tool for doing that. It's not clear from the current discussions where tabular data fits into the roadmap. The discussions around Containers and “Compiled Code as Data” are interesting to me, and I see how they apply to reproducibility of research, but I wonder how they relate to the simpler task of managing tabular data.

Specifically, I’m unclear about the relationship between dat-graph, Compiled Code as Data and hyperfs

How do these things fit together?

Question: Is the goal of this project drifting?

Confusion: ticket maxogden/dat#407 says the “main goal of dat is to provide a way to version control and sync code (containers) AND data (files)”. Where does tabular data fit into this? In the current version of dat, at least according to the documentation,there are two types of content: tabular data and binary files, where the tabular data are what you pull in via dat import and files are what you pull in via dat write. This shift to speaking in terms of "code (containers) AND data (files)" sounds different from the goal listed on the dat-data website:

“The high level goal of the dat project is to build a streaming interface between every database and file storage backend in the world. By building tools to build and share data pipelines we aim to bring to data a style of collaboration similar to what git brings to source code.”

Referring to git again: git is not docker, nor do people use git to manage docker containers. In order to run code off of git reproducibly, sometimes they use docker containers. They do not expect git to handle that for them. It seems like you’re aiming for dat to do both. Is that true? If so, why?

Question: Does this new work relate to allowing a notion of a “working copy”?

Are you changing dat to have a notion of a “working copy”?

With git you have a working copy, which is the contents of your git repo manifest within a hierarchy of files & directories. If i modify, delete, or add a file within that hierarchy I can run a git diff and git will tell me where my working copy differs from the contents of the git repository. I can then stage those changes for commit using git add and then commit those changes to git using git commit. This is extremely useful.

By contrast, dat does not have a notion of a working copy. Each dat repo lives in a dedicated directory that contains your data.dat and package.json files but dat is not aware of anything else in the directory. Of course, you can fake a working directory by exporting the contents of your dat repository into the dat directory, but dat itself does not handle this for you.

Is this work with hyperfs and tracking containers aimed at changing that? Is that the main goal of the work, or is the main goal of the work more expansive than that?

max-mapper commented 8 years ago

How does tabular data fit into the current roadmap around Containers?

It's on the back burner for us at the moment. It's a more comolex use case than what we wanna focus on first, which is Dropbox-style file sync functionality

Does this new work relate to allowing a notion of a “working copy”?

This is up in the air at the moment, we will be experimenting with different tradeoffs over the next couple of months as we keep hacking on the internals

Overall now we're trying to go back to basics and make something really simple and solid. We have explored a lot of advanced use cases and now can try and do something that will be a good foundation for those things in the future.

The main use case we wanna get right is syncing files based on a replicated Merkle DAG. Everything else (containers, tabular data) we're planning on tackling again later once we have something nice and simple.

flyingzumwalt commented 8 years ago

Wow. I'm glad I asked. This brings up some important questions. I'll try to pose them as "yes/no" and "either/or". I want to clarify my understanding of your intentions without forcing you to go into detail about stuff that's in flux.

In short, my core question is: are you simply focusing your forward momentum on file sync while still embracing tabular data or are you completely abandoning tabular data needs/interests? I think the work you've done with tabular data is really important. I want to put time into building out its uses in real-world scenarios and I suspect that many other people have already been doing that. It would be a shame to lose that momentum.

Clarification: Is this switch to dat-graph and the Merkle DAG a rewrite, or is it just switching out part of the implementation? Sometimes you talk about dat-graph and the Merkle DAG as if it's just a refinement of what you're already doing, but other times it sounds like a rewrite. Which is it?

Q: Will you maintain the existing baseline of functionality for tabular data or will you be abandoning that functionality and rebuilding from scratch with a focus exlusively on the Dropbox-style file sync? In other words, after you switch to dat-graph, will I be able to import tabular data into the Merkle DAG using commands like dat import, export that data using dat export, etc? Will dat continue to have that functionality?

Q: Do you need someone to step up and champion the tabular data uses? Is this an opportunity to say "hey, if people out there want to use dat in that way they should step up and start committing code."

Q: Do you want more committers? Would it mess things up if committers start piling on whose priority is tabular data while your priority is file sync?

Q: Is your goal for dat to eventually hold file, container and tabular data together in one graph? It seems like you think of tabular data, "file" content and "containers" as a continuum of data that could ultimately all coexist in a single Merkle DAG or a network of DAGs that reference each other. That's ambitious, but it makes sense if most of the use cases you've seen call for it. Is that the direction you're heading?

jbenet commented 8 years ago

It seems like you think of tabular data, "file" content and "containers" as a continuum of data that could ultimately all coexist in a single Merkle DAG or a network of DAGs that reference each other

i really really really wish @maxogden thought about it this way. This is the basic thought behind ipfs, and have been trying to get Dat to interop with IPFS for a year. have you changed your mind on this @maxogden ? You can see our latest graph over here:

It's extremely simple, all it defines is a way to do a merkle link, everything else is the json data model. (but stored as cbor to be binary packed, and seekable). AND you can actually store it as protobuf if you want to, there's a way to do it (but in that case the hashes won't match, unless we agree on which format to represent the hash in. there is a canonical cbor, so this is crypto friendly. I don't recall if protobuf is canonical or depends on the implementation, but i wouldnt be surprised if it depends on implementation.

We are about to pull in this change to our data format, so if you would like to work with us on this, please surface any complaints / design desires / etc now and am sure we can address them before merging the new thing. I really do not understand at all why this one is not good enough for your use cases. do you have concrete design concerns that we can outline and then find agreement or disagreement on? right now it's a black box to me.

okdistribute commented 8 years ago

In the new version, containers are also just files.

It will be nice to have help with the tabular use case once we are there, but yes, import and export are going away.

@jbenet I don't see how ipld is optimized for forks and merges

mafintosh commented 8 years ago

@jbenet this is our current graph format, https://gist.github.com/maxogden/9ebd17dc839f065d12f6#graph-format - it's pretty basic since our graph requirements aren't that advanced, we just need nodes to be able to link to other nodes.

i don't understand why the js cbor/jsld api is async on decode. that would make integration hard for us as we rely heavily on sync decoding/encoding

mafintosh commented 8 years ago

@jbenet side note, where can i read the ipld spec referenced by the repos? then i can give a more concrete answer on what we need - i'm not really sure how it works currently

jbenet commented 8 years ago

(sorry for delay-- flying around)

@mafintosh

i don't understand why the js cbor/jsld api is async on decode. that would make integration hard for us as we rely heavily on sync decoding/encoding

We can make a sync version just fine. not sure why it was async. But this is a trivial change. Filed issue.

where can i read the ipld spec referenced by the repos? then i can give a more concrete answer on what we need - i'm not really sure how it works currently

yep, you're right we were missing writing this up. I took a stab here -- https://github.com/ipfs/specs/blob/ipld-spec/merkledag/ipld.md -- specifically https://github.com/ipfs/specs/blob/ipld-spec/merkledag/ipld.md#linking-between-nodes -- suuuper basic! -- the goal is to support the simplest json docs ever, so that users can make very simple datastructs, with their own structure and their properties in the links. can you take a look and let me know what is unclear to you (this is a first pass writeup)?

Notes:

The { "mlink": "<hash>" } definition is for json. in CBOR we'll use a tag to compress that. Suspect similar things can be done in protobuf land.
And yes, you can use protobuf fine with ipld (e.g. we are honoring all out current protobuf encoded merkledags too!), but we strongly recommend cbor for nice json compatibility, but without the downsides (cbor has raw binary fields, ints, and so on -- these will decode as EJSON (json with a standard way to designate binary values), though we haven't implemented that part yet as we haven't needed it).
You can also ignore all the paths stuff, unless you want to have nice path traversals through the entire graph on the web. Path traversals are not required, the traversal rules are just there to ensure nice interop with utf-8 and the web. So you can name the links whatever you want but they just wont resolve through if they're "bad path components" so that (a) users aren't harmed, but (b) devs aren't imposed upon.

jbenet commented 8 years ago

@jbenet I don't see how ipld is optimized for forks and merges

@karissa it's optimized to let you define + optimize for what you want. See:

(note also that formats can be different, which helps with a lot of science folk which will want ways to define / export graphs in JSON, JSON-LD, RDF, YML, and a variety of other formats.)

jbenet commented 8 years ago

Anyway, sorry. we can continue discussing this elsewhere if it would be good not to hijack @flyingzumwalt's thread.

flyingzumwalt commented 8 years ago

That's ok @jbenet. I think @karissa basically answered my questions, but I'm completely fuzzy about time frame. "once we get there" and "on the back burner" can mean anything from "we'll be back on it next month" to "Man, we're never going back to slay that gnarly beast." That makes it hard for me to know how & when to engage, or how proactive to be about drumming up committers.

okdistribute commented 8 years ago

We're going to go with a more git/github style approach with respect to tabular data. That means, file viewers can get fancy with the data but we won't assume any file types.

flyingzumwalt commented 8 years ago

:+1:

On Friday, June 17, 2016, Karissa McKelvey notifications@github.com wrote:

Closed #33 https://github.com/datproject/discussions/issues/33.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/datproject/discussions/issues/33#event-696174477, or mute the thread https://github.com/notifications/unsubscribe/AAIesjQaV885z2nMT5f0K5lBsfwvq86Cks5qMsxXgaJpZM4GYjCF .

dat-ecosystem-archive / datproject-discussions