Open tangjeff0 opened 4 years ago
Datomic | datahike | OpenCrux | |
---|---|---|---|
scalability | 100B datoms | M of entities | dependent on document size |
time | uni-temporal | uni-temporal | bi-temporal |
license | closed-source | EPL 1.0 | MIT |
storage services | DynamoDB, Cassandra, JDBC SQLs | LevelDB, Redis, PostgreSQL | RocksDB, LMDB, Kafka, JDBC SQLs |
From Jeroen in the Slack:
Maybe start with Datomic (the best known and most mature option) and postpone this decision? Unless someone has a clear vision on this. I think these three should be mostly compatible at query and data model level. If something becomes difficult with Datomic, reconsider (e.g. when implementing collaboration features)? Or when people have trouble setting up the free version, and don’t want to pay for the commercial version, reconsider. If this is an upfront certainty, go for datahike or OpenCrux right away? (edited)
I can only speak for Crux on these points...
Pros:
valid-time
timestamp (see: https://jaredforsyth.com/posts/hybrid-logical-clocks/ & CockroachDB) Cons:
Hope that helps!
Edit: this might be of interest: https://findka.com/blog/migrating-to-biff/ (Firebase-like stack on top of Crux)
From Christopher Small, author of datsync
My hope is that DatSync will be able to support Datahike on the backend, and I have no objections to supporting Crux if there aren't technical blockers. Is it [Athens] mainly focused on small deployments for a sort of DIY self-hosted Roam? If so, and you'd mostly be expecting data from small numbers of users, you can probably get things working with any of these tools. If however, you are hoping to have large centralized (but OSS) hosting available, then you'd need to think about scalability, and I think your best option there would be Datomic. Datahike has pretty decent query performance, but writes have the potential to be a bottleneck, so you can look into where that pain point hits. For lots of (relatively) small deployments though, datahike would be perfect. If I knew more about crux I might be able to say more about its advantages, but if you are looking to use DataScript on the client, Datahike is a fork, and so likely to have more impedence match.
We've not looked at datsync in any detail but we have spent some time thinking about crux->datascript replication already: https://github.com/crux-labs/crux-datascript/blob/master/src/crux_datascript/core.clj
Just to also chime in and add a few things that have not been said yet:
Yes, Datahike is still very much compatible with DataScript and moreover we are aiming to port our query engine with durability back over to ClojureScript in our next release as well (after 0.3.0
which is pending), so Datahike will be able to substitute for DataScript and optionally provide client-side durability at the same time. We have implemented all our abstractions as replikativ
libraries in a platform neutral way from the start, the main thing missing is to provide ClojureScript asynchronous IO support in Datahike's query engine code. This is a very doable task, it was just easier and more attractive to get the JVM version working well first. Replicating Datahike will be possible with P2P web technology, such as demonstrated in https://lambdaforge.io/2019/12/08/replicate-datahike-wherever-you-go.html. We are convinced that we need to find better business models than the current data silo approach.
We also provide a Datomic compatible core API that is used by our commercial clients, so if you decide to stick to the common subset, you will be able to swap Datahike in at any point. If you hit missing features or incompatibilities, please open an issue. We are currently working on our write throughput and I am confident that we can scale to Datomic size deployments in principle, it was just a matter of priorities.
We, the members of LambdaForge, are also big fans of the Zettelkasten method (even before we were aware of Roam) and use https://org-roam.readthedocs.io/en/latest/ at the moment. We would be super happy to see a reliable open source implementation like Athens to succeed, so keep going :100: !
I think ideally the backends should be exchangeable, so even if you decide for one, keep in mind when you buy into its specific semantics.
Although I don't consider myself an expert in databases, I guess one of the (future) advantages of Datahike would be that it could potentially enable "local first" as described here: https://www.inkandswitch.com/local-first.html. For me this would be great to have in a tool like Athens because you could easily edit offline on multiple machines, while having confidence that your edits could later on combine seamlessly.
Thanks so much for sharing that link. Several engineers (including myself) are quite interested in local first applications. We've discussed databases like OrbitDB, Gun, and Scuttlebutt. Datahike is very interesting for this reason.
@tangjeff0 no problem. I guess Datahike isn't quite there yet, but maybe @whilo can share something about whether Datahike would allow a local-first workflow in the future?
Yes, since our early work on http://replikativ.io/, which was predating most of these other local first approaches, but did not attract a large community back then and also did not have a nice programming model such as Datalog, we wanted to be local-first. We aim to port Datahike back to ClojureScript in our next iteration. Do you think open-collective would work to fund this work? Any help would be appreciated, as we are currently still hammering out Datomic compatibility and some scalability issues in the JVM version.
Will re-open when after v1 is complete
TL;DR;
Do you plan to support block-level access control and notifications/subscriptions? If so, how do you plan to do this? Maybe the DB is a deal-breaker.
Hi there. I've been discussing with @tangjeff0 a little bit on Twitter about your plans and how they could be linked with ours.
I also had some experience working with heavily nested and linked content with my previous project www.collectiveone.org and I have a couple of comments regarding the DB and how to handle the multi-player case:
access control at block level: Ideally you want access control at the block level. But you need some sort of "default" inheritance logic to be able to switch access control of a whole area at once. This is done in notion at the page level, with stuff like (permissions of this page are defined by this other workspace...). Besides inheritance, I would like to have composability: so that you can say stuff like "those with access to A AND or OR access to B can access this". Also, access control must be super fast as it is computed almost every time a block is read.
subscriptions and notifications: Ideally here you also need some sort of inheritance logic, so that If I have a block to which I want to be notified of changes, I get notified every time any of its children blocks changes. Each user can have different notifications settings for each object, and one block can be in many places at the same time, so it's very hard to, once there is an event on one block, determine who you need to send that email/push notification.
I did this in Postgres the last time I tried and relied a lot on algorithmic recursion, so I navigated the DB in many directions before determining what to do, or who to send a message to. This was too slow. I am not an expert in big data systems, so I really wonder how these problems should be actually handled.
Another factor I'd like to point out is the conflict resolution story, whether it be distributed or centralized.
Another option could be to use a Git Repository as a backend. This would require creating a REST API on top of the Git repo to parse the documents, but it would result in greater compatibility with existing tools. One will be easily able to have to files locally, and even use other markdown editors or more advanced editors like Obsidian. And there is also a mobile app already ready (GitJournal - I'm the author)
This would result in a very different architecture though. I'm willing to help, if you want to go down this route. I would love more tools to be compatible with each other.
You could also consider https://github.com/terminusdb/terminusdb-server
I'd just like to add that for me, Athens being open-source is a significant advantage over Roam, and if Athens ends up requiring a closed-source backend to be most useful, that advantage would be diminished.
Also it would be nice to abstract the backend-talking code to allow people to potentially run Athens on other backends, as long as they support some defined protocol.
Protocol is always most ideal but hardest to pull off. Crux, Datomic, Datascript, Datahike will inevitably have some differences with each other.
Agree that closed-source backend diminishes value. Inevitably parts of our infrastructure will be closed, but if there is a fully open-source full-stack solution for users to self-host, super great.
Also just learned about https://github.com/fluree/ from Matei. Clojure, Web3, open-source.
https://www.youtube.com/watch?v=uSum3uynHy4&feature=youtu.be
Hi there! I'm glad to see some movement here :slightly_smiling_face:
We have been working on an interface specification for our Athens-like app so that the backend is abstracted. We have also been working hard on a NodeJS + DGraph backend API that is AGPL-like open-sourced.
I'd bet the interface supports (or will support) all the needs of Athens. Who knows! :muscle:. It includes backlinks and search features, granular access control (and thus multi-player), and fast data creation and fetching.
Reusing our backend, or just the interface, will also provide interoperability among our apps. Users will be able to embed and edit blocks from Athens in Intercreativity, for example. They can also "fork" them as we want to support GIT-like flows with content.
Oh, and eventually Athens could connect to other data storage solutions. We have prototypes for OrbitDB, Ethereum, Kusama, and IndexedDB (local).
This is a recent demo of our latest milestone (a simple case where users mix private with public content). We are about to release a new version where users can explore a feed of blog posts.
If you want to run it, this repo should run ok on Ubuntu or Mac. It is our latest development version.
Oh, and this is our discord in case you want to reach us. :wave:
The video @tangjeff0 mentioned above covers both the broad vision and technical details of Fluree better than I could, but here's my quick take:
Fluree is an in-memory, semantic graph database backed by a permissioned blockchain, built with Clojure and open-source.
It can be containerized (with Kubernetes support) and optionally decentralized (e.g. using StorJ via Tardigrade), run as a standalone JVM service, or embedded inside the browser as a web worker. Read more here about the query server (fluree-db) and the ledger server (fluree-ledger)
Since Fluree extends RDF (official W3C standard for data interchange), it immediately becomes interoperable with the linked open datasets on the semantic web. One interesting use case would be to directly query DBPedia or Wikidata from within Athens and combine it with your own data at runtime, without an API. Additionally an RDF foundation means you can build ontologies with any of the modeling languages that build on top of it (RDFS, OWL, etc., which are the official recommendations of W3C), which opens up capabilities for inferencing and automated reasoning.
From my view, Fluree could be a powerhouse tool to strongly differentiate Athens from Roam and every other "tool for networked thought." Between RDF standards and a permissioned blockchain (which allows for block/cell-level access control), you could seamlessly and securely deploy Athens at an individual, team, or enterprise level using the same scalable infrastructure.
Would love to get the Fluree team's thoughts here...
@quoll I would like to advocate for the adoption of https://github.com/threatgrid/asami but feel like it would better be left up to the expert :) Athens is currently Clojurescript/re-frame/datascript/posh (I'm working on sunsetting posh rn actually)
What are your thoughts on whether Asami would be a good fit as a graph-DB for us?
Selfishly, I will admit I would LOVE the excuse to combine our opensource powers to leverage the benefits of bi-directional knowledge linking, use Asami in the wild and possibly have the opportunity to work with you in a technical aspect to help implement it if we do end up going this way... and I don't think I'd be the only one!
Love to help. I hope to have Asami 2.0-alpha out by the end of the week. This will have storage when on the JVM. JavaScript is coming, but in the meantime it will have save/load functions. Unfortunately, Asami doesn’t have all the APIs of the other stores, e.g. the Pull API.
I've looked a bit into Datahike. From what I learned it looks like:
As someone new to Clojure, this makes me less nervous about depending on a backend that has a Datomic-like API, and optimistic about Datahike, becaue it would still allow freedom of backing storage system.
@mateicanavra laid out Fluree for us very well in his comment above. I will elaborate a little on some of the points made and bring up one additional one, which is one of the most powerful parts of Fluree. The foundation of RDF is intended to enable data interoperability across the semantic web and provides a very flexible data model for the applications built on top of it. Our immutable ledger brings both decentralization and horizontal scaling in the transaction tier, if that is needed, as well as some benefits that are brought to bear from querying historical data states in from earlier in the block chain. We have segregated the query and transactions tiers, such that the query engine and an LRU cache of data can be loaded in-memory on the client device using a service worker. I would imagine for a personal Athens graph, that may be the entire thing, which enables millisecond query responses. The db (query peer) is also linearly scalable, but im not sure that really applies to the use case here. The biggest advantage Fluree brings are SmartFunctions. Because each transaction is encrypted with the user's private key, the data can be permissioned at the individual RDF element level. You could write the SmartFunctions in such a way that no one else would have access to them and a user could share as desired.
Noting this work-in-progress Datahike backend for the benefit of those following this issue: https://github.com/athensresearch/athens-backend
Also, I recently pulled together a comparison matrix for various Clojure-Datalog stores: https://clojurelog.github.io/