athensresearch / athens

Athens is no longer maintainted. Athens was an open-source, collaborative knowledge graph, backed by YC W21
https://athensresearch.github.io/athens
Other
6.31k stars 399 forks source link

Backend architecture: Datomic, datahike, OpenCrux, datalevin, Fluree #9

Open tangjeff0 opened 4 years ago

tangjeff0 commented 4 years ago
Datomic datahike OpenCrux
scalability 100B datoms M of entities dependent on document size
time uni-temporal uni-temporal bi-temporal
license closed-source EPL 1.0 MIT
storage services DynamoDB, Cassandra, JDBC SQLs LevelDB, Redis, PostgreSQL RocksDB, LMDB, Kafka, JDBC SQLs
tangjeff0 commented 4 years ago

From Jeroen in the Slack:

Maybe start with Datomic (the best known and most mature option) and postpone this decision? Unless someone has a clear vision on this. I think these three should be mostly compatible at query and data model level. If something becomes difficult with Datomic, reconsider (e.g. when implementing collaboration features)? Or when people have trouble setting up the free version, and don’t want to pay for the commercial version, reconsider. If this is an upfront certainty, go for datahike or OpenCrux right away? (edited)

refset commented 4 years ago

I can only speak for Crux on these points...

Pros:

  1. Regular, fully-featured releases w/ transparent roadmap (e.g. upcoming JSON and SQL support might help non-Clojure Athens users to build tools/integrations): https://github.com/juxt/crux/projects/1
  2. Low memory requirements makes it particularly suitable for self-hosting (this is mostly because the query engine is lazy)
  3. Setting up a collaborative Crux-backed environment could be as simple as having a group of users share access to a managed Kafka service, see https://juxt.pro/blog/posts/crux-confluent-cloud.html (vs. always having to maintain a bunch of centralised DB infrastructure somewhere)
  4. Dev team that is excited and keen to see Athens succeed
  5. There's a tantalising possibility that bitemporality could be an invaluable capability in a collaborative context. We're already thinking about the feasibility of using Hybrid Logical Clocks in place of a simple valid-time timestamp (see: https://jaredforsyth.com/posts/hybrid-logical-clocks/ & CockroachDB)
  6. An EQL syntax is available for "pull" https://github.com/juxt/crux/issues/849

Cons:

  1. We're still in Beta - so there may be a few API changes, but nothing too fundamental
  2. Crux is schemaless, but not magic, so you still need to have some idea of what your schema looks like :)

Hope that helps!

Edit: this might be of interest: https://findka.com/blog/migrating-to-biff/ (Firebase-like stack on top of Crux)

tangjeff0 commented 4 years ago

From Christopher Small, author of datsync

My hope is that DatSync will be able to support Datahike on the backend, and I have no objections to supporting Crux if there aren't technical blockers. Is it [Athens] mainly focused on small deployments for a sort of DIY self-hosted Roam? If so, and you'd mostly be expecting data from small numbers of users, you can probably get things working with any of these tools. If however, you are hoping to have large centralized (but OSS) hosting available, then you'd need to think about scalability, and I think your best option there would be Datomic. Datahike has pretty decent query performance, but writes have the potential to be a bottleneck, so you can look into where that pain point hits. For lots of (relatively) small deployments though, datahike would be perfect. If I knew more about crux I might be able to say more about its advantages, but if you are looking to use DataScript on the client, Datahike is a fork, and so likely to have more impedence match.

refset commented 4 years ago

We've not looked at datsync in any detail but we have spent some time thinking about crux->datascript replication already: https://github.com/crux-labs/crux-datascript/blob/master/src/crux_datascript/core.clj

whilo commented 4 years ago

Just to also chime in and add a few things that have not been said yet:

Yes, Datahike is still very much compatible with DataScript and moreover we are aiming to port our query engine with durability back over to ClojureScript in our next release as well (after 0.3.0 which is pending), so Datahike will be able to substitute for DataScript and optionally provide client-side durability at the same time. We have implemented all our abstractions as replikativ libraries in a platform neutral way from the start, the main thing missing is to provide ClojureScript asynchronous IO support in Datahike's query engine code. This is a very doable task, it was just easier and more attractive to get the JVM version working well first. Replicating Datahike will be possible with P2P web technology, such as demonstrated in https://lambdaforge.io/2019/12/08/replicate-datahike-wherever-you-go.html. We are convinced that we need to find better business models than the current data silo approach.

We also provide a Datomic compatible core API that is used by our commercial clients, so if you decide to stick to the common subset, you will be able to swap Datahike in at any point. If you hit missing features or incompatibilities, please open an issue. We are currently working on our write throughput and I am confident that we can scale to Datomic size deployments in principle, it was just a matter of priorities.

We, the members of LambdaForge, are also big fans of the Zettelkasten method (even before we were aware of Roam) and use https://org-roam.readthedocs.io/en/latest/ at the moment. We would be super happy to see a reliable open source implementation like Athens to succeed, so keep going :100: !

I think ideally the backends should be exchangeable, so even if you decide for one, keep in mind when you buy into its specific semantics.

jelmerderonde commented 4 years ago

Although I don't consider myself an expert in databases, I guess one of the (future) advantages of Datahike would be that it could potentially enable "local first" as described here: https://www.inkandswitch.com/local-first.html. For me this would be great to have in a tool like Athens because you could easily edit offline on multiple machines, while having confidence that your edits could later on combine seamlessly.

tangjeff0 commented 4 years ago

Thanks so much for sharing that link. Several engineers (including myself) are quite interested in local first applications. We've discussed databases like OrbitDB, Gun, and Scuttlebutt. Datahike is very interesting for this reason.

jelmerderonde commented 4 years ago

@tangjeff0 no problem. I guess Datahike isn't quite there yet, but maybe @whilo can share something about whether Datahike would allow a local-first workflow in the future?

whilo commented 4 years ago

Yes, since our early work on http://replikativ.io/, which was predating most of these other local first approaches, but did not attract a large community back then and also did not have a nice programming model such as Datalog, we wanted to be local-first. We aim to port Datahike back to ClojureScript in our next iteration. Do you think open-collective would work to fund this work? Any help would be appreciated, as we are currently still hammering out Datomic compatibility and some scalability issues in the JVM version.

tangjeff0 commented 4 years ago

Will re-open when after v1 is complete

pepoospina commented 4 years ago

TL;DR;

Do you plan to support block-level access control and notifications/subscriptions? If so, how do you plan to do this? Maybe the DB is a deal-breaker.


Hi there. I've been discussing with @tangjeff0 a little bit on Twitter about your plans and how they could be linked with ours.

I also had some experience working with heavily nested and linked content with my previous project www.collectiveone.org and I have a couple of comments regarding the DB and how to handle the multi-player case:

I did this in Postgres the last time I tried and relied a lot on algorithmic recursion, so I navigated the DB in many directions before determining what to do, or who to send a message to. This was too slow. I am not an expert in big data systems, so I really wonder how these problems should be actually handled.

tangjeff0 commented 4 years ago

Another factor I'd like to point out is the conflict resolution story, whether it be distributed or centralized.

vHanda commented 4 years ago

Another option could be to use a Git Repository as a backend. This would require creating a REST API on top of the Git repo to parse the documents, but it would result in greater compatibility with existing tools. One will be easily able to have to files locally, and even use other markdown editors or more advanced editors like Obsidian. And there is also a mobile app already ready (GitJournal - I'm the author)

This would result in a very different architecture though. I'm willing to help, if you want to go down this route. I would love more tools to be compatible with each other.

almereyda commented 4 years ago

You could also consider https://github.com/terminusdb/terminusdb-server

agentydragon commented 3 years ago

I'd just like to add that for me, Athens being open-source is a significant advantage over Roam, and if Athens ends up requiring a closed-source backend to be most useful, that advantage would be diminished.

Also it would be nice to abstract the backend-talking code to allow people to potentially run Athens on other backends, as long as they support some defined protocol.

tangjeff0 commented 3 years ago

Protocol is always most ideal but hardest to pull off. Crux, Datomic, Datascript, Datahike will inevitably have some differences with each other.

Agree that closed-source backend diminishes value. Inevitably parts of our infrastructure will be closed, but if there is a fully open-source full-stack solution for users to self-host, super great.

Also just learned about https://github.com/fluree/ from Matei. Clojure, Web3, open-source.

https://www.youtube.com/watch?v=uSum3uynHy4&feature=youtu.be

pepoospina commented 3 years ago

Hi there! I'm glad to see some movement here :slightly_smiling_face:

We have been working on an interface specification for our Athens-like app so that the backend is abstracted. We have also been working hard on a NodeJS + DGraph backend API that is AGPL-like open-sourced.

I'd bet the interface supports (or will support) all the needs of Athens. Who knows! :muscle:. It includes backlinks and search features, granular access control (and thus multi-player), and fast data creation and fetching.

Reusing our backend, or just the interface, will also provide interoperability among our apps. Users will be able to embed and edit blocks from Athens in Intercreativity, for example. They can also "fork" them as we want to support GIT-like flows with content.

Oh, and eventually Athens could connect to other data storage solutions. We have prototypes for OrbitDB, Ethereum, Kusama, and IndexedDB (local).

This is a recent demo of our latest milestone (a simple case where users mix private with public content). We are about to release a new version where users can explore a feed of blog posts.

If you want to run it, this repo should run ok on Ubuntu or Mac. It is our latest development version.

Oh, and this is our discord in case you want to reach us. :wave:

mateicanavra commented 3 years ago

The video @tangjeff0 mentioned above covers both the broad vision and technical details of Fluree better than I could, but here's my quick take:

Fluree is an in-memory, semantic graph database backed by a permissioned blockchain, built with Clojure and open-source.

It can be containerized (with Kubernetes support) and optionally decentralized (e.g. using StorJ via Tardigrade), run as a standalone JVM service, or embedded inside the browser as a web worker. Read more here about the query server (fluree-db) and the ledger server (fluree-ledger)

Since Fluree extends RDF (official W3C standard for data interchange), it immediately becomes interoperable with the linked open datasets on the semantic web. One interesting use case would be to directly query DBPedia or Wikidata from within Athens and combine it with your own data at runtime, without an API. Additionally an RDF foundation means you can build ontologies with any of the modeling languages that build on top of it (RDFS, OWL, etc., which are the official recommendations of W3C), which opens up capabilities for inferencing and automated reasoning.

From my view, Fluree could be a powerhouse tool to strongly differentiate Athens from Roam and every other "tool for networked thought." Between RDF standards and a permissioned blockchain (which allows for block/cell-level access control), you could seamlessly and securely deploy Athens at an individual, team, or enterprise level using the same scalable infrastructure.

Would love to get the Fluree team's thoughts here...

lambduhh commented 3 years ago

@quoll I would like to advocate for the adoption of https://github.com/threatgrid/asami but feel like it would better be left up to the expert :) Athens is currently Clojurescript/re-frame/datascript/posh (I'm working on sunsetting posh rn actually)

What are your thoughts on whether Asami would be a good fit as a graph-DB for us?

Selfishly, I will admit I would LOVE the excuse to combine our opensource powers to leverage the benefits of bi-directional knowledge linking, use Asami in the wild and possibly have the opportunity to work with you in a technical aspect to help implement it if we do end up going this way... and I don't think I'd be the only one!

quoll commented 3 years ago

Love to help. I hope to have Asami 2.0-alpha out by the end of the week. This will have storage when on the JVM. JavaScript is coming, but in the meantime it will have save/load functions. Unfortunately, Asami doesn’t have all the APIs of the other stores, e.g. the Pull API.

agentydragon commented 3 years ago

I've looked a bit into Datahike. From what I learned it looks like:

As someone new to Clojure, this makes me less nervous about depending on a backend that has a Datomic-like API, and optimistic about Datahike, becaue it would still allow freedom of backing storage system.

Tr3yb0 commented 3 years ago

@mateicanavra laid out Fluree for us very well in his comment above. I will elaborate a little on some of the points made and bring up one additional one, which is one of the most powerful parts of Fluree. The foundation of RDF is intended to enable data interoperability across the semantic web and provides a very flexible data model for the applications built on top of it. Our immutable ledger brings both decentralization and horizontal scaling in the transaction tier, if that is needed, as well as some benefits that are brought to bear from querying historical data states in from earlier in the block chain. We have segregated the query and transactions tiers, such that the query engine and an LRU cache of data can be loaded in-memory on the client device using a service worker. I would imagine for a personal Athens graph, that may be the entire thing, which enables millisecond query responses. The db (query peer) is also linearly scalable, but im not sure that really applies to the use case here. The biggest advantage Fluree brings are SmartFunctions. Because each transaction is encrypted with the user's private key, the data can be permissioned at the individual RDF element level. You could write the SmartFunctions in such a way that no one else would have access to them and a user could share as desired.

refset commented 3 years ago

Noting this work-in-progress Datahike backend for the benefit of those following this issue: https://github.com/athensresearch/athens-backend

Also, I recently pulled together a comparison matrix for various Clojure-Datalog stores: https://clojurelog.github.io/