dat-ecosystem-archive / datproject-discussions

a repo for discussions and other non-code organizing stuff [ DEPRECATED - More info on active projects and modules at https://dat-ecosystem.org/ ]
65 stars 6 forks source link

Issues with “Why not just use git?” #44

Open joehand opened 8 years ago

joehand commented 8 years ago

From @wking on June 2, 2014 22:30

In the what-is-dat docs you list some reasons to not use Git:

  • Large numbers of commits add significant overhead. A repository with millions of commits might take minutes or hours to complete a git status check.

I expect git status times to scale with working directory size, not number of commits. I don't have benchmarks to back this up though.

  • To quote Linus Torvalds, "Git fundamentally never really looks at less than the whole repo", e.g. if you have a repository of a million documents you can't simply clone and track a subset.

You can if you use submodules. Although that's still publisher-driven subsetting, and not consumer-driven subsetting.

  • git stores the full history of a repository. What if you only want to store the latest version of each row in a database and not a copy of every version? This needs to be optional (for disk space reasons).

You can if you use shallow clones.

I agree that a data-centered version control system can probably make optimizations that aren't available to the source-code-focused Git. However, I'd like to see that discussed (e.g. why are dat commits lighter weight? Or does dat not have commits at all?) instead of attributing dubious technical limitations to Git. Of course, maybe you're claims are valid, in which case, they might just need a bit more supporting material to convince folks like me ;).

Copied from original issue: maxogden/dat#121

joehand commented 8 years ago

From @thadguidry on June 2, 2014 23:56

I don't think the contributors will waste their energy on convincing others, Trevor. Good troll try however.

On Mon, Jun 2, 2014 at 5:30 PM, W. Trevor King notifications@github.com wrote:

In the what-is-dat docs https://github.com/maxogden/dat/blob/master/docs/what-is-dat.md#why-not-just-use-git you list some reasons to not use Git:

  • Large numbers of commits add significant overhead. A repository with millions of commits might take minutes or hours to complete a git status check.

    I expect git status times to scale with working directory size, not number of commits. I don't have benchmarks to back this up though.

  • To quote Linus Torvalds, "Git fundamentally never really looks at less than the whole repo", e.g. if you have a repository of a million documents you can't simply clone and track a subset.

    You can if you use submodules. Although that's still publisher-driven subsetting, and not consumer-driven subsetting.

  • git stores the full history of a repository. What if you only want to store the latest version of each row in a database and not a copy of every version? This needs to be optional (for disk space reasons).

    You can if you use shallow clones.

I agree that a data-centered version control system can probably make optimizations that aren't available to the source-code-focused Git. However, I'd like to see that discussed (e.g. why are dat commits lighter weight? Or does dat not have commits at all?) instead of attributing dubious technical limitations to Git. Of course, maybe you're claims are valid, in which case, they might just need a bit more supporting material to convince folks like me ;).

— Reply to this email directly or view it on GitHub https://github.com/maxogden/dat/issues/121.

-Thad +ThadGuidry https://www.google.com/+ThadGuidry Thad on LinkedIn http://www.linkedin.com/in/thadguidry/

joehand commented 8 years ago

From @waldoj on June 2, 2014 23:59

@wking, do you know of any actively used, large-scale, Git-backed data repositories that support storage of only selected versions or subsets of repositories?

joehand commented 8 years ago

From @wking on June 3, 2014 0:7

On Mon, Jun 02, 2014 at 04:56:30PM -0700, Thad Guidry wrote:

I don't think the contributors will waste their energy on convincing others, Trevor.

Hey, if it scratches your itch ;). I'm just pointing out things that look inaccurate to me, while I try to get my finger on what the itch is. Maybe a pull request removing the technical limitations and pointing to dat's technical notes 1 would be more productive than just pointing out holes. I coudn't find docs explaining how merging and conflict resolution is handled. It looks like (from 1) the dat data model is just a collection of protocol buffers 2, and I'm not sure how that works with distributed development. Or maybe dat isn't about distributed development, and is more about centralized development with easy, distributed, post-processed consumption?

joehand commented 8 years ago

From @wking on June 3, 2014 0:12

On Mon, Jun 02, 2014 at 04:59:20PM -0700, Waldo Jaquith wrote:

@wking, do you know of any actively used, large-scale, Git-backed data repositories that support storage of only selected versions or subsets of repositories?

Nope. But I don't know of any actively used, large-scale distributed data repositories of any type. Ah, I see from the slides that diffing/merging is out of scope 1.

joehand commented 8 years ago

From @waldoj on June 3, 2014 0:27

Nope. But I don't know of any actively used, large-scale distributed data repositories of any type.

And yet there's a strong demand for such software. (I'm speaking at tomorrow's U.S. Department of Transportation "Datapalooza" event, at which the need for such software looms large in the minds of many attendees.) Why do you think people aren't using Git for this, if it's as straightforward a proposition as you've explained that you believe it to be?

joehand commented 8 years ago

From @wking on June 3, 2014 0:56

On Mon, Jun 02, 2014 at 05:27:57PM -0700, Waldo Jaquith wrote:

And yet there's a strong demand for such software. (I'm speaking at tomorrow's U.S. Department of Transportation "Datapalooza" event, at which the need for such software looms large in the minds of many attendees.) Why do you think people aren't using Git for this, if it's as straightforward a proposition as you've explained that you believe it to be?

I think that diffing/merging CSV files in Git is not easy, and if you only have a single publisher, there's no benefit to using Git. On the distributed database front, notmuch uses Git for distributing message tags 1 and I've seen a number of other projects do something like that. There's also a Git-backed mailstore in ssoma 2. I haven't used ssoma at all, and I haven't found an nmbug repository I'd classify as “large”, so I'm not sure how these will scale. But at least with nmbug Git is only used for syncronization. Fast access is handled by Xapian, using whatever nice index format it likes internally. I've never used Git for tracking real-time data, and I would be concerned about the per-commit header overhead. However, without commits, collaborative development is going to be tricky (as far as I understand it), and I don't know what I'd change except culling fields I didn't think were critical. Maybe you could have a chain of lightweight messages, with an additional heavyweight commit that provided some sort of aggregate metadata for the preceeding messages (allowing you to publish a sliding commit that grew until the metadata changed)? Not super elegant. In any case, I'm sure you've all thought about this longer than I have, and I'm just looking for a pointer that lays out the model for what sort of information dat is pushing around (table entries with implicit schemas?) and how it handles distributed development (e.g. pushing tempature readings from Alice's thermocouple and wind speeds from Bob's anemometer), with fixes from Charlie (Bob's anemometer was miss-calibrated in January). Maybe it's just so Dan can see what's changed since Alice, Bob, and Charlie got together and released February's data?

joehand commented 8 years ago

From @jortizcs on June 3, 2014 15:3

I know the developers of this project are probably too busy to spend time answering the questions posed by @wking but I think they are quite valid and merit a deeper discussion. It's unclear what dat is particularly good at that git cannot handle -- even if it's some obscure feature in git. What is fundamentally different about dat? Where does it succeed where other subversioning systems fail? Particularly in comparison to git. Partial checkouts are not a new feature (subversoin supports this). @wking Actually does a nice job of listing his questions and I'm sure many others following this project would like to get a good technical summary/explanation of them from the team.

What's not so useful is pointing to the demand from "Datapalooza" attendees or backing from the US Department of Transportation as reasoning behind the design decisions. It's also not useful to accuse @wking of trolling when his questions are perfectly valid, well articulated, and important for some of us following this project, to get a deeper understand of dat.

joehand commented 8 years ago

From @knowtheory on June 3, 2014 15:19

@jortizcs Referencing the Datapalooza was actually putting the onus back on @wking to explain why git has not filled this niche (which granted is both a large, and in many ways non-technical question).

That may be a sufficient reason for dat to exist by itself, if this particular community is ill served by git (as much as i <3 git, it is essentially a power tool with a manual written in greek. people have and will continue to lose fingers and limbs), then maybe it's worth having a tool which makes different tradeoffs.

Even if you find the first claim dubious (or at least quibble-able in my case), @wking's response to the second two points are essentially a concession to reasons why git is ill-suited to storing data sets:

  1. Submodules require up-front design decisions about the architecture of how people will access your data.
  2. Your choices with git are get a snapshot or get the full repository.

None of this speaks to whether dat appropriately addresses those problems, but the desirability of the first point at least is self-evident (i am unclear under what conditions you would only want partial history, except in conjunction with the first point).

joehand commented 8 years ago

From @jortizcs on June 3, 2014 15:35

@knowtheory Fair enough. I hope the discussion continues. Although I think dat might be a very useful tool -- I deal with large scientific data sets quite often and i've tried using git to keep history and do version control, unsuccessfully -- I'm not entirely convinced that I couldn't "simply" combine certain features of SVN and Git to get what I need. I'm also not completely sure there isn't some obscure feature in git or some other tool that might solve the problems i'm having with large dataset versioning. Perhaps dat does a really nice job of these things, which is why i'm interested to see where you folks are taking it and i'm quite interested in playing with the tool myself. I don't think you have to convince the people following this project that there are problems dealing with large data sets and versioning system that are worth solving -- at least not me. It's just not clear that the problem(s) you set to solve with your design is the problem(s) we have.

I would like to see what problem(s) dat is particularly good at solving, as @wking stated. Perhaps reading through the source code might help but maybe this discussion could lead to a more detailed technical document or blog that thoroughly goes through what dat is designed to do well.

joehand commented 8 years ago

From @wking on June 3, 2014 17:31

On Mon, Jun 02, 2014 at 05:27:57PM -0700, Waldo Jaquith wrote:

And yet there's a strong demand for such software.

I'm still trying to wrap my head around what “such software” is actually doing ;). Digging through past issues, here is my current understanding of dat's architecture:

Here's my test script:

!/bin/sh

#

With dat v4.5.2 (since then 'dat serve' has become 'dat listen')

The local JSON variables avoid:

#

stream.js:94

throw er; // Unhandled stream error in pipe.

^

Error: EPIPE, write

#

when head closes the pipe after reading the first line from cat.

rm -rf alice bob charlie && mkdir alice && (cd alice && dat init && echo '{"date": "2014-06-02", "temperature": 62.5}' | dat --json && echo '{"date": "2014-06-03", "temperature": 64.1}' | dat --json ) && ( (cd alice && dat serve) & sleep 5 && echo "Bob clones" && dat clone http://localhost:6461 bob && echo "Charlie clones" && dat clone http://localhost:6461 charlie && echo "Bob edits the temperature" && (cd bob && JSON=$(dat cat) && echo "${JSON}" | head -n1 | sed -s 's/62.5/60.5/' | dat --json && dat push http://localhost:6461 && echo ) && echo "Chalie edits the date" && (cd charlie JSON=$(dat cat) && echo "${JSON}" | head -n1 | sed -s 's/2014-06-02/2014-06-01/' | dat --json && dat push http://localhost:6461 && echo ) && killall node && echo "Alice" && (cd alice && dat cat) && echo "Bob" && (cd bob && dat cat) && echo "Charlie" && (cd charlie && dat cat) && echo 'done' )

And here are the results:

$ ./test.sh Initialized dat store at /tmp/z/alice/.dat 1 rows imported in less than a second 1 rows imported in less than a second Listening on http://localhost:6461 Bob clones Pulling from _changes has completed.Loaded dat store at bob/.dat

Charlie clones Pulling from _changes has completed.Loaded dat store at charlie/.dat

Bob edits the temperature 1 rows imported in less than a second 1 rows pushed in less than a second Chalie edits the date 1 rows imported in less than a second 0 rows pushed in less than a second Alice {"_id":"2014-06-03T17:14:49.728Z-de945cb9","_rev":"2-22b50c8573a45e3b7ef52a475d067ac9","date":"2014-06-02","temperature":"60.5"} {"_id":"2014-06-03T17:14:53.396Z-0bb804b5","_rev":"1-a5f28724acfe3e571b875922eff1a0cc","date":"2014-06-03","temperature":"64.1"} Bob {"_id":"2014-06-03T17:14:49.728Z-de945cb9","_rev":"2-22b50c8573a45e3b7ef52a475d067ac9","date":"2014-06-02","temperature":"60.5"} {"_id":"2014-06-03T17:14:53.396Z-0bb804b5","_rev":"1-a5f28724acfe3e571b875922eff1a0cc","date":"2014-06-03","temperature":"64.1"} Charlie {"_id":"2014-06-03T17:14:49.728Z-de945cb9","_rev":"2-99e4b4704027bb13f3ca92d58f22dfc2","date":"2014-06-01","temperature":"62.5"} {"_id":"2014-06-03T17:14:53.396Z-0bb804b5","_rev":"1-a5f28724acfe3e571b875922eff1a0cc","date":"2014-06-03","temperature":"64.1"} done

So it looks like vs. Git you lose:

And you gain:

I haven't had time to dig into the partial-clone/filtering functionality yet.

joehand commented 8 years ago

From @wking on June 3, 2014 18:13

On Tue, Jun 03, 2014, Ted Han wrote:

Referencing the Datapalooza was actually putting the onus back on @wking to explain why git has not filled this niche (which granted is both a large, and in many ways non-technical question).

Sorry if it's taking me a while to get more specific. I'm still trying to wrap my head around the niche that we're addressing here ;).

maybe it's worth having a tool which makes different tradeoffs.

Definitely. And I'd love a “Why not just use git?” that addressed those tradeoffs. I just don't think the current docs are hitting that nail squarely on the head.

Submodules require up-front design decisions about the architecture of how people will access your data.

Maybe not up-front (filter-branch is a wonderful thing ;), but certainly publisher-side decisions. If you're a govornment agency that doesn't want to bother sorting out valid user requests from the noise, I can see how offloading this decision making to the consumer would be a good thing. You're still going to want server-side filtering though, otherwise you're back in the “I'll just clone this and filter-branch it myself” boat.

Your choices with git are get a snapshot or get the full repository.

Git does give you lots of flexibility on the depth of your history (which is independent of the full/partial tree issues discussed above). Just 'clone --depth …' with the number of revisions you'd like.

None of this speaks to whether dat appropriately addresses those problems,

What the actual problems are and how dat addresses them is the information I'm digging for ;).

but the desirability of the first point at least is self-evident

Agreed.

(i am unclear under what conditions you would only want partial history, except in conjunction with the first point).

I think the use case is “I don't care (much) about what happened in the past for this project. Just give me the current state (with maybe a tiny bit of historical context) so I can use this baby! And make it easy for me to continue to stay up to date as the project advances.” For example, if I want to patch Emacs because it doesn't detect new XTerms, but don't want all ~130k commits. I usually clone the whole repository anyway, because doing some archeology in the commit history helps me avoid repeating past mistakes ;).

joehand commented 8 years ago

From @mitar on December 21, 2014 1:41

I would add one more question: if git has some limitations not suitable for data, why not improve git? I think there are many actors in git space who would be interested in that. For example, GitHub is used more and more for data as well. Having better support and features in git for data would be useful for all.

I was once told that the end goal of any mission driven non-profit team should be to make themselves obsolete (mission accomplished). What is better than to get necessary improvements pushed into git and we are done. :-)

joehand commented 8 years ago

From @paulharris on June 18, 2015 2:43

daff + git can diff and merge CSV files.

I think this thread has valid questions, because I am deciding between git and dat for storing and merging some data.

joehand commented 8 years ago

From @karissa on June 19, 2015 0:19

Git has some fundamental problems with the way it works under the hood. From the whitepaper:

"Git was designed to track changes to relatively small text files, so many datasets, especially those that change often, are cumbersome to use with Git. Two years ago, Facebook created a test Git repository to explore its limits, and found that at 1.3 million files with 15 GB of data, Git takes upwards of 40 minutes to respond to a simple command [7]. To fix this performance problem with Git, there is an extension called the 'Git Large File Storage (LFS)' released in 2015 [8], and Git-Annex [9], written in Haskell. We believe these extensions are not designed well for data science use cases, which often require support for features like real-time data feeds, partial replication, or parallel downloading."

Git also doesn't version by row -- it isn't table-aware.

https://github.com/maxogden/dat/blob/master/docs/whitepaper.md

joehand commented 8 years ago

From @paulharris on June 19, 2015 3:4

This is how I'm reading it,

So, for dat's case, dat solves the problem with two approaches:

1) import into a table format and then ability to download changes. Which is basically database + replication. But that requires the data to be in a tabular format that is importable, AND assumes the format can be diff'd during import and detect minimal changes automatically.

2) the read/write system just keeps versions of the files. Just treated as binary opaque blobs? or with block level dedup? Which hopefully works when 1 character is added to the start of the file (ie dedup can handle blocks becoming unaligned, ie like rsync's rolling checksums). Or custom differs/mergers? Otherwise .dat folder gets very large with time?

So, for # 1, a dataset is like a MariaDB database table with a push and pull interface (or in MongoDB terms, _changes replication stream). And, can do some query and filters, like OPeNDAP.org.

For # 2, seems like a file store with historic snapshots for versioning. No diffing, merging, etc. ie for "attachments" that are replaced wholesale.

And # 1 is where all the streaming diffing merging happens.

The whitepaper seems to focus on # 1 entirely.

Did I read it correctly? cheers, Paul

On 19 June 2015 at 08:19, Karissa McKelvey notifications@github.com wrote:

Git has some fundamental problems with the way it works under the hood. From the whitepaper:

"Git was designed to track changes to relatively small text files, so many datasets, especially those that change often, are cumbersome to use with Git. Two years ago, Facebook created a test Git repository to explore its limits, and found that at 1.3 million files with 15 GB of data, Git takes upwards of 40 minutes to respond to a simple command [7]. To fix this performance problem with Git, there is an extension called the 'Git Large File Storage (LFS)' released in 2015 [8], and Git-Annex [9], written in Haskell. We believe these extensions are not designed well for data science use cases, which often require support for features like real-time data feeds, partial replication or parallel downloading."

https://github.com/maxogden/dat/blob/master/docs/whitepaper.md

— Reply to this email directly or view it on GitHub https://github.com/maxogden/dat/issues/121#issuecomment-113325308.

joehand commented 8 years ago

From @thadguidry on June 19, 2015 3:34

I would simply evoke the idea of ...

dat comes with a server and set of client technologies that provides a Distributed Data Store (storage)

(It is unlike Git, Bittorrent, HDFS, etc. which provide Distributed File Storage, Distributed File Systems, or Source/Version Control Systems)

dat lets you store and distribute versions of your data, while also allowing exploration, sharing, and updating of your data.

Why complicate the core philosophy with dissimilar backend goals that might be provided by various plugins ? Just stick to explaining the core dat philosophy and your golden.

mitar commented 5 years ago

@Vedikaledange Nice spam and link placement. I would report it if I could.

millette commented 5 years ago

@mitar I reported the user, his comment and account are now gone.

joehand commented 5 years ago

Thanks @millette + @mitar !