CodeForPhilly / jawn

'Git for Tabular Data'
http://datjawn.com
BSD 3-Clause "New" or "Revised" License
44 stars 9 forks source link

jawn as a wrapper around levelup and hypercore #15

Open flyingzumwalt opened 8 years ago

flyingzumwalt commented 8 years ago

@ogd @mafintosh as I spec out the basic features for dat-tables, it looks like our hypercore feeds will basically contain a log of CREATE, UPDATE, and DELETE operations. I want to avoid writing an API that unnecessarily duplicates the leveldown API or (gadzooks) ending up with yet another SQL-ish API that nobody wanted. Would it be easier to make levelup more central in the operations and implement a sort of 'smart' log that follows 'put' and 'delete' events emitted by levelup, writing the changes to a hypercore feed?

Assumptions

The key features for an MVP of dat-tables are:

I had been thinking of the index as a tail of the feed -- always reading operations from the feed (even locally) and applying them to the index accordingly -- but I think that was the wrong approach.

dat-tables as a wrapper around levelup and hypercore

Note: The levelup database always represents the HEAD of your data.

This lets us focus on making dat-tables support operations like 'checking out' old versions and running diffs against old versions of the data. It also lets us focus on refining the API to support target use cases.

Does this make sense? Am I missing anything?

max-mapper commented 8 years ago

It might be easier to implement leveldown instead of levelup, cause then you can get levelup for free (cause you just pass in a leveldown instance into levelup when creating it and it will use that)

Also I think it would be helpful to have a simple use case to start with first

flyingzumwalt commented 8 years ago

Good point @maxogden. I've been so focused on assembling the team that I neglected the use case gathering. I've posted a call Seeking Use Cases for a 'Git for Tabular Data' here: https://github.com/datproject/discussions/issues/37

reubano commented 8 years ago

Forgive my ignorance of LevelDB. But the more I read about hypercore and the dat syncing infrastructure, the more I wonder if CouchDB could perhaps provide a simpler solution. Is there any reason not to explore that option (aside from maintaining dat compatibility)?

flyingzumwalt commented 8 years ago

@reubano the index itself does not get synced across hosts. I've added some explanation to Issue #9:("Tail" the feed into an index) for clarification, including a bit of discussion about different options for which db to use.

max-mapper commented 8 years ago

@reubano excellent question, I actually used to heavily use CouchDB. I started by using CouchApps so that the logic of the application could be self-contained in Couch, but quickly outgrew what CouchApps can support (since they are just single page apps and no way to do stuff in a persistent background process). I moved to doing my applications using Node and using CouchDB as a more traditional database backing my app. When LevelDB came on the scene it simplified a lot of things because instead of having to run a CouchDB server and a separate Node server, your single Node process is both your application server and your database server with LevelDB.

I've found that using LevelDB lets you be a lot more flexible in choosing tradeoffs for replication, data formats and conflict management. If CouchDB does exactly what you want then it can be a good option, as long as you are comfortable with things like Erlang stack traces and writing performance critical map reduce functions in Erlang occasionally. I was able to get pretty far with it, but ultimately when writing a command line tool it was impractical to expect users to manually install a CouchDB on their machine in order to use the dat CLI.

reubano commented 8 years ago

@maxogden I agree couchdb would be difficult to use in a CLI. For that I think leveldb (or even sqlite) would work to store the local index and then sync to a couchdb server via pouchdb. My question is more about a need for the hypercore feed at all. Since jawn only deals with tabular data, couldn't you represent each row as a couchdb document? Then with pouchdb you'd get the syncing and replication out of the box. It seems like this is the exact use case couch was designed for, no?

The only issue to overcome is figuring out how to re-purpose couchdb to save all revisions (since old revisions aren't available in views and they get deleted during compaction).

Granted, I may be unfairly attributing dat's complexity to leveldb. So if there is a way to simplify things for jawn without couchdb, I'd be interested in hearing those options as well.

max-mapper commented 8 years ago

@reubano good points. I would summarize the dat/leveldb approach as more DIY than using CouchDB. CouchDB is like a framework that has a built in solution for what you need to do, but it gets difficult when you need to do something not supported or differently than it does it. The thing I like about LevelDB is that it lets you decide what your tradeoffs are, and lets you only implement the features you need. Which one is better depends on the use case of course.

flyingzumwalt commented 8 years ago

Note: I learned that hyperkv and hyperlog cover a lot of this same terrain. While looking at that I've been and wondering if you could get a lot of the same benefit by just making hyperdrive watch your leveldb logs (SSTables) and then tailing those feeds back into leveldb instances. Seems like that would leverage the core design of leveldb and SSTables quite nicely. (at least according to this 2012 article)

So creating, updating and deleting rows/tables is just done using leveldb interface however you see fit and then "commits" and replication are handled as hypercore feeds. That would be pretty powerful.