Questions about WebDB and Injest

vsivsi commented 6 years ago

1) I'm confused by the discussion of the distinction between the DatFS, Injest and WebDB "layers" on the WebDB page. In particular, all links to Ingest redirect from https://github.com/beakerbrowser/ingestdb to https://github.com/beakerbrowser/webdb. Are these separate packages/things? Is this documentation out of date and everything has now been rolled into WebDB? 🤔

2) At a glance WebDB / IngestDB looks to be a reasonably well thought-through JS database implementation. But I wonder why you are taking this on (defining/implementing/maintaining yet another DB/Query Language/API on top of IndexedDB / LevelDB) when many other possible solutions exist, are much more mature, and have a ton of mindshare and tooling around them... I understand well that evaluating and debating the merits of one database over another can be a black hole, but the answer can't always be "we'll just avoid all that by creating yet another database project!" I have zero direct involvement with any other Web database projects, and so have no personal investment in your choice; but I have to ask: did you consider just adopting/extending something like PouchDB? Perhaps you did and found other available solutions lacking in some way (that writing a plug-in/extension couldn't solve), but I haven't found any discussion of that anywhere here.

It's clear that having some kind of robust database solution on top of Dat and easily accessible to Beaker Apps will be a huge win, so kudos for what you've done so far.

I'm also interested in the relationship you see between WebDB / IngestDB and the Dat team's own "roll your own" Dat backed JS database project HyperDB. There seems to be some significant overlap in vision with that as well.

Anyway, I'm not asking this intending to criticize, I'm just genuinely interested in why you decided to take on a sub-project in the context of Beaker that has the potential (over time) to be such a heavy lift, while not seeming to be the core value-add of what you are creating here. Beaker and the P2P Web discussion it is igniting are too exciting to have you guys bogged down maintaining yet another browser database!

Thanks for reading!

webdesserts commented 6 years ago

@vsivsi As far as question 1 goes, Several things have changed since the initial proposal. WebDB was originally intended to be a built-in browser-module that added some opinionated schemas. WebDB was going to be built around IngestDB which would be a userland module. After a bit of debating @pfrazee decided to move WebDB entirely into userland (for now anyway). As a part of this move IngestDB was renamed to WebDB.

I can't answer for @pfrazee on question 2 but I would suppose they went with an IndexedDB wrapper since IndexedDB was already embedded in the browser and would be easier to explain on top of existing APIs. Another thing to keep in mind is that WebDB is mostly just an API wrapper around the normal dat filesystem. The project as a whole is less about providing a new database (especially since IndexedDB already exists) and more about providing a cleaner API for interacting with the dat filesystem.

webdesserts commented 6 years ago

@vsivsi See this thread by @pfrazee for a little bit of context on the IngestDB/WebDB merge.

vsivsi commented 6 years ago

@webdesserts Thanks for your response and the link to the twitter thread above. That definitely helped answer Q1 above. I’m still interested in the decision to build an all new thing rather than adopt and extend a more mature package. I kind of view DBs similarly to Crypto: it’s almost never the best move to invent your own (but for somewhat different reasons).

Anyway, with WebDB becoming less opinionated and moving to Userland, it seems that other solutions can happily coexist with it should different applications benefit from different solutions (and competition among ideas/approaches is generally a good thing so long as things don’t get too fragmented).

pfrazee commented 6 years ago

Yeah good questions @vsivsi.

WebDB is mainly an indexer for JSON files in dats. It wraps Dat and uses Level/IndexedDB, and though it has its own little querying API, it's a relatively small module. The core value is its indexing toolset, which is not available anywhere already.

If WebDB were implemented within the browser, I'd consider building it with SQLite so that we could take advantage of its query language, but that's not the case. WebDB is run inside the website's sandbox, and so it needs a datastore which is already built into browsers. Thus, the choice to use Level/IndexedDB.

In a lot of ways, I think you're approaching this with the same mentality as we have. We chose to run WebDB inside the website's sandbox because it's the lowest-risk, lowest-effort option we have to provide the toolset. Any existing DB would need to be implemented in the browser core as a new Web API, and so would need more consideration and development.

@webdesserts Good explanation. I should make sure to update the proposals with a disclaimer explaining what happened after they were published.

pfrazee commented 6 years ago

Oh, something else to mention.

HyperDB solves a very different purpose than WebDB. HyperDB is very similar to LevelDB - it's a keyvalue store. However, it's built with the network in mind, and in fact is like a LevelDB with builtin global publishing, replication, and CRDTs.

This is a bit confusing, but-- HyperDB will actually become a part of Dat's internal stack. Currently, the stack looks like this:

Hypercore. An append-only signed log.
Hyperdrive. A filesystem built on top of Hypercore.

HyperDB will be inserted in the middle, and give Hyperdrive a new ability to handle multiple writers (via the CRDT):

Hypercore. An append-only signed log.
HyperDB. A keyvalue store with support for multiple writers.
Hyperdrive. A filesystem built on top of HyperDB.

So then, to round it out, we have WebDB on top of all that, providing secondary indexes on top of the hyperdrive FS:

Hypercore. An append-only signed log.
HyperDB. A keyvalue store with support for multiple writers.
Hyperdrive. A filesystem built on top of HyperDB.
WebDB. A secondary indexer built on top of Hyperdrive.

A bit confusing, but 🤷‍♂️

vsivsi commented 6 years ago

@pfrazee Thanks for all of that explanation! I've been following Dat for over a year, and am well aware of the "multiple writers" issue as well as the current lack of a robust "fork and merge" workflow for even the simplest of non-conflicting cases. It's affected my ability to settle on Dat for multiple projects because it just keeps popping up. So the work on HyperDB is a hopeful step.

As for the discussion of WebDB and what it does and where its value-add is, I think we may be using the term "database" a little too imprecisely here. At one level, people consider things like LevelDB and IndexedDB to be "databases" (it's in their names!) And they are databases, just really stripped down KV-stores, as you point out. But for anyone who's ever used a relational SQL DB, or a mature document store like MongoDB or CouchDB, these seem like pretty primitive (foundational) technologies. This is all well-trodden ground.

And so you've developed WebDB to provide more user-land features on top of (Dat) files + Chromium's built-in IndexedDB implementation (subbing in LevelDB for node.js). I get all of that.

I guess my point is that pretty much this precise problem has already been solved, multiple times, by projects that are much more mature than Beaker itself. If you take something like PouchDB, it fills exactly the same niche you describe for WebDB. It is a lightweight library that runs in the sandbox (and on node.js) using IndexedDB / LevelDB as its backing store. It has rich indexing features and a full featured query language. It also has the ability to write simple plug-ins to support alternative backing stores (or a hybrid of such stores), and syncing and conflict resolution between multiple instances of the same database, etc. It also has solutions for defining and enforcing JSON schemas, etc. Sound familiar?

And that's just one example. There are other libs doing the same kinds of things too. I only highlight PouchDB because it is the one I'm most familiar with and my perception is that it is among the most mature and widely used, but I'm not wedded to it.

Anyway, I guess the kernal of my second question above, that I still haven't seen any substantive discussion of, is why WebDB? It's a new, mostly unproven, lightly documented, database API with no deployment track-record. Beaker apps have a huge advantage if they only need to run in the Beaker / Electron / Chromium environment. But of course that may not be a great assumption for a lot of projects (e.g. as soon as you add an HTTPS gateway for a Dat hosting a Beaker app, you no longer control the browser). IndexedDB implementations are notoriously finicky about certain stuff. Do you really want to have to support/work around every quirky different browser implementation so that a Beaker App can run correctly over an HTTP gateway? With something like PouchDB, all of that thankless crap has been ironed out and proven...

I don't mean for this to be a rant. I would feel much better if you could point me to a discussion where the pros and cons of adopting/adapting something mature like PouchDB had been weighed against the risks/advantages of creating something all new like WebDB just for Beaker. This is not a criticism of WebDB or all of the great work you've done here. It just really feels like WebDB is a re-invention of a wheel that's been pretty well perfected in other forms, and so given that Beaker is still a small community with a very long and exciting TODO list, developing, promoting and maintaining yet another web database feels like a serious distraction.

Anyway, I think I've made my point. And given that WebDB exists and we aren't debating some future roadmap item, my next step is to dig deeper into it and see how it stacks up so I can make more concrete suggestions moving forward.

Thanks!

vsivsi commented 6 years ago

Just one more bit of info. This page and its diagram may make what I say above a bit more clear.

https://pouchdb.com/adapters.html

So in my conception of this, if your Dat file directory/json reader/writer could be easily buttoned up into a PouchDB compatible "adaptor", then you could jettison all responsibility for all of the other parts of WebDB that are necessary to make it a viable "database". Over time those "other parts" will grow to dwarf the core functionality you are after here, which is the Dat-based backing store.

As an aside, it looks like someone already made a quick stab at a Dat plugin for PouchDB, although I think they just took the strategy of dumping the complete PouchDB replication stream into a single monolithic Dat versioned file. That is a much cruder approach than what you have defined, so there's probably not too much to be learned from this code: https://github.com/calvinmetcalf/pouch-dat

pfrazee commented 6 years ago

After a point, it becomes less about a "correct choice" and more about tradeoffs in design priorities. On this project, I felt that the quality of the API would be better if we wrote a bespoke interface. Compared to a Pouch adapter, this wasn't significantly more work; I adapted the query mechanics from Dexie.js.

I gave Pouch a look but didn't feel like the CouchDBisms mapped cleanly to Dat, such as its replication APIs. I settled on using https://www.npmjs.com/package/level-js, mainly to provide interface uniformity between node and the browser.

I appreciate your points and I'd encourage folks to make a Pouch Dat adapter, but I like the results we've gotten with WebDB and I think we made the right call.

vsivsi commented 6 years ago

Fair enough. Thanks for answering my questions! I'm architecting a new scientific data management project with private foundation funding over a 5 year time span. We're still gathering requirements, but I'm evaluating Dat and Beaker to be potential foundational technologies for the project. Anyway, that's my motivation for digging into this and asking annoying questions. Trying to figure out what might work, what has legs, where we can pitch-in to make a difference, etc. If we can identify specific needs, there may be the possibility of going back to the foundation for more money as well. Thanks again!

pfrazee commented 6 years ago

Sure thing! Feel free to open issues for any other questions you have.

aral commented 6 years ago

@pfrazee What’s the best place to read up on the specific CRDT algorithm in hyperdb*? (And thank you so much for the absolutely awesome work you’re doing, by the way) :)

(*) I’ve been doing a lot of research into CRDTs lately and I’m most fond of a causal tree/logoot/LSEQtree-style approach. Looking into @jimpick’s excellent Dat Shopping List example and whatever it is seems to work well although I’d love to understand what its limitations are a little better.

pfrazee commented 6 years ago

@aral This is still a WIP, but there's a spec here: https://github.com/datprotocol/DEPs/pull/10. It is built on top of https://www.datprotocol.com/deps/0004-hyperdb/.

Broadly speaking, it uses a "Multi-value register" CRDT; it holds conflicts until resolution. There's also an interesting CRDT structure being built on Dat's hypercore structure called hypermerge

aral commented 6 years ago

@pfrazee Thank you so much, Paul. Just left a quick comment there with a link to my recent research into the academic papers for CRDT at https://github.com/datprotocol/DEPs/pull/10#issuecomment-395140837

Looking forward to checking out hypermerge and it looks like we will hopefully be basing Indie Site on DAT which I’m very excited about – not least because of the awesome community and the extremely intelligent and caring folks you have here :)

pfrazee commented 6 years ago

@aral cool! LMK if you need help with anything

beakerbrowser / specs

Questions about WebDB and Injest #5