cloudstateio / cloudstate

Distributed State Management for Serverless
https://cloudstate.io
Apache License 2.0
762 stars 97 forks source link

Rename CRDT to something nicer #144

Open jroper opened 5 years ago

jroper commented 5 years ago

No one knows what CRDTs are, and they just introduce a layer of unfamiliarity. We should rename them, eg, call the replicated entities. This would mean updating all of the APIs to use this terminology, as well as all the docs.

viktorklang commented 5 years ago

Let’s make things more confusing—EvEntities: Eventual Entities 😂

On Fri, 15 Nov 2019 at 16:51, James Roper notifications@github.com wrote:

No one knows what CRDTs are, and they just introduce a layer of unfamiliarity. We should rename them, eg, call the replicated entities. This would mean updating all of the APIs to use this terminology, as well as all the docs.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cloudstateio/cloudstate/issues/144?email_source=notifications&email_token=AAACU54UVEBPFRWFE4Q6FI3QT3APNA5CNFSM4JN4VIPKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HZUWNEA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACU54CZHIMYUABGFCGWILQT3APNANCNFSM4JN4VIPA .

-- Cheers, √

jsravn commented 5 years ago

I'd recognize CRDT and not "replicated entities". How would renaming it help understanding?

jroper commented 5 years ago

I'd guess you're not our target audience then. I'm a room of 100 developers in a serverless computing talk right now, my money is on if I were to stand up and ask "who has heard of CRDTs", only 10 people would put up their hands, and 9 of those would think I'm referring to Kubernetes CRDs.

marcellanz commented 5 years ago

they have no conflicts... so call them "Happy Entities". But seriously, does a developer need to know they're using CRDTs? Perhaps, we could educate what they are good for, with examples. Everyone digging into the subject will find out we're talking about CRDTs. They should be easy enough to be used, so interest comes along with their usefulness.

jsravn commented 5 years ago

@marcellanz I think CRDTs are a power feature, so most people wouldn't use them.

My point above is "replicated entities" is even less understandable than "CRDTs". At least you can google CRDT and get the wikipedia page, a medium article explaining them, and so on. Obfuscating what they are won't help anyone. If we're going to offer a higher level abstraction that is another thing and it would make sense to give it a new name.

jroper commented 5 years ago

I think CRDTs are a power feature, so most people wouldn't use them.

That's the state of the union today, without Cloudstate, but it's the opposite of what we're aiming to achieve with Cloudstate. If CRDTs are a power feature, they have no place in Cloudstate. Cloudstate is all about taking distributed state management patterns, things that previously were only power features that only advanced users could take advantage of, and making them available to the masses. If we can't do that for CRDTs, we shouldn't offer support for them. And I would go further than that, I would say if we can't do that for CRDTs, then Cloudstate is fundamentally flawed in its goals, it's goals are impossible. But I don't believe that to be the case.

So, do we need to mention CRDTs in order for users to be able to use them? I think a) if we do we're already failing at doing what Cloudstate is meant to achieve, and b) no. The users aren't implementing CRDTs themselves, it is actually a slightly higher abstraction, they never write a merge function, which is the most fundamental concept to a CRDT. They just use the data types offered. They take an ORMap, and use it, they don't need to know any of the mechanics of how it works, about the tagging, node vectors etc. They do need to understand the constraints on using it, that you can only remove something if you have observed it, that if one node updates an entry and another node concurrently removes it, the update will win, etc, but those constraints can be (and are in our documentation currently) enumerated. There's no need to explain why those constraints exist, the why is because it's a CRDT and the merge function is implemented in a certain way. We can certainly offer advanced documentation for that, but understanding that is not a prerequisite to using it.

Plenty of distributed databases implement the features they provide using CRDTs without mentioning this fact to users. We can do the same. Here's a replicated map, here's the constraints on using it, go make something awesome. Yes, we can mention that it's CRDTs underneath, and users who are interested can go and read up on that, but I don't think that understanding that is a prerequisite to using them.

I also think saying "this map is a CRDT" doesn't help users, you look up CRDTs and you see the abstract concepts about how merging is done, that says nothing about the particular implementation of the map CRDT that cloudstate offers, and therefore says nothing about how to use it. So, I don't think the fact that you can google CRDT helps.

jsravn commented 5 years ago

I agree it's a worthwhile effort to try and hide implementation details as long as they don't affect the user. With distributed state that is a very difficult thing to achieve in general. CRDTs may be the closest we can get to it, as they have fairly few edge cases (like tombstone generation in ORMaps, which a robust implementation should take care of). So I think it can make sense to build our own abstractions on them and name them in accordance with the simpler interface. As long as users are happy with the restrictions of CRDTs of course.

I have a lot experience specifically w/ a distributed database (Cassandra) where misuse of its distributed data structures (like Lists or Maps) without understanding how they work leads to major problems (w/ tombstone generation). It doesn't use CRDTs under the hood though, so it's not a 1-to-1 comparison. But it does illustrate we need to be careful not to hide the details that matter, and that users can't always be ignorant of what's going on depending on how the implementation works.

marcellanz commented 5 years ago

This Issue and the discussion resonates with the first challenge, the standards effort, Cloudstate states to take on besides its reference implementation. Do you guys have a roadmap towards the standards efforts? It seems to me, this Issue brings questions bottom up from the implementation that builds up or drives the standards effort.

This is especially interesting as other projects aiming similar goals and motivations. WDYT?

justinhj commented 4 years ago

Just my 2c on this: Users that may have trouble with the terminology CRDT will also have trouble with the CRDT's themselves such as GCounter, OrSet and so on. But users are generally quite happy with Redis data structure names, which sound familiar to programmers such as sets, sorted sets, lists and hashes.

So maybe if CRDT is given a friendly name, the same will need to be done for the CRDT the users choose.

I think I agree with @jsravn that it would be nice to come up with a friendly name and build simple API's around the most useful CRDTs.

Something like Shared Data Toolkit with an Increment Counter, Increment-Decrement Counter and so on.

Although I agree with @marcellanz that the user being able to google for CRDT and GCounter is also a compelling argument for just dealing with the real names.

jroper commented 4 years ago

Problem with something like Increment-Decrement Counter is what do you call it in the code? IncrementDecrementCounter is getting to be a very long name.

justinhj commented 4 years ago

Yes that's true :) and if you shorten it it ends up as cryptic as the real names

jboner commented 4 years ago

I usually talk about them as Replicated Datastructures (sometimes Eventually Consistent Datastructures), and always qualify with that when mentioning CRDTs. Akka calls them Distributed Data BTW.

I think it settling for a name other than CRDTS, that is more descriptive, is a good move. But then we should make sure to explain them in the docs as CRDTs (allowing people to Google it and dig deeper, and also getting indexed on CRDT by Google).

For CRDTs to be the smashing success it deserves I think that we need to create a higher level of more targeted use-case specific CRDTs (building on and composing more low-level ones) that are given high-level descriptive names. F.e. everyone understands what a ShoppingCart is, what to use it for, and how it could intuitively function in a distributed system.

He-Pin commented 4 years ago

Crdts is a very bad name for the introduction. Can you imagine what it is the first without googling it?

michaelpnash commented 4 years ago

Summarizing: I think we've settled on "Replicated Entity" for the datastructure formerly known as CRDT

marcellanz commented 4 years ago

Do we update documentation and code for that? How should that be communicated in the use of such types? Would we use the term CRDT anymore at all?

ralphlaude commented 4 years ago

Nice CRDT is renamed to Replicated Entity. I think the next steps here should be to use the new name in the code (language support, proxy and samples) and to adapt the docs. Is it right?

jboner commented 4 years ago

I agree that if we do a name change then it should be everywhere, including code. It is even more confusing if naming is not consistent. IMO, the only potential reference to the word "CRDT" should be in one single place, in the docs, where we explain what these Replicated Entities are and point to external sites for further reading for the people that want to dig deeper.

pvlugter commented 4 years ago

Yes, rename CRDT Entity to Replicated Entity everywhere. Definitely refer to CRDTs in the docs for further reading.

And then there are also the data structures in the language support user APIs, referred to as CRDTs or Crdts in both docs and code. We could use something more long form and aligned with replicated entity, such as replicated data types or replicated data structures instead of the CRDT initialism. And maybe use just ReplicatedData in code, so we have GCounter implements ReplicatedData or similar. Or we could keep using CRDT for the data structures themselves.

marcellanz commented 4 years ago

Will there be possibly ever any other replicated entities not based on CRDTs?

pvlugter commented 4 years ago

Will there be possibly ever any other replicated entities not based on CRDTs?

We may look at something like Akka's replicated event sourcing — which treats the event streams as CRDT-like. I think it's fine to have the current replicated entities, and figure out where something like replicated event sourcing fits in later on.

Okay, so let's rename CRDT Entity to Replicated Entity and CRDT to Replicated Data and generally refer to replicated data (types|structures) with links to CRDTs for reference.

There's some in-progress work on old CRDT entities, changes to the protocol (#479) and TCK additions. Could be good to have these merged first, and then I can go through the renaming.

jroper commented 4 years ago

While I agree that we should rename them to Replicated Entities, in the code, everywhere, I do think in the docs the term CRDT should feature prominently. Users need to understand that replicated entities aren't arbitrary data structures, they are constrained, they have a merge function, this merge function defines the semantics of how concurrent updates are handled - ie, they are CRDTs. When talking about the CRDT semantics, I think we shouldn't shy away from saying what they are. Replicated Entities are Cloudstate's implementation of CRDTs.

So for example, in the API docs for whatever we call an ORMap (assuming we don't just call it an ORMap, not sure what the plan is), users shouldn't have to go digging to learn that this is an ORMap CRDT. It should be just there, with docs reading something like this:

/**
 * A replicated map that supports addition and removal of keys, with values being replicated entities themselves.
 * 
 * This is an implementation of the Observed-Remove Map (OR-Map) CRDT. <description of ORMap semantics here>
 *
 * ....
 */

The point is that we don't try and hide from the user what they are doing. They will be more successful if they understand what a CRDT is, so while we don't want to push that in their face, we do want to give them a straight forward path to discovery so that they can understand exactly what they are doing when they are ready to. I think appropriate reference to CRDTs in the docs when describing the CRDT semantics is the key to doing that.

sleipnir commented 4 years ago

The point is that we don't try and hide from the user what they are doing. They will be more successful if they understand what a CRDT is, so while we don't want to push that in their face, we do want to give them a straight forward path to discovery so that they can understand exactly what they are doing when they are ready to. I think appropriate reference to CRDTs in the docs when describing the CRDT semantics is the key to doing that.

I agree with this proposal and it was the conclusion that we reached in previous discussions. CRDT must not be omitted entirely from the user and must be well related to the replicated CloudState entities for users

pvlugter commented 4 years ago

Yes, we definitely don't want to hide what they are. I think ReplicatedData is a good replacement in code and user APIs where we've been using Crdt, and is also the interface used in Akka. I see this as more about avoiding using the initialism everywhere, so that it's easier to understand, not about hiding the concepts.

And we can still have pointers to the more technical terms, such as specifically implementing state-based CRDTs, which are also called convergent replicated data types and abbreviated as CvRDTs, that they're delta state CRDTs, and so on.

jboner commented 4 years ago

I agree, James. That's a great point. The key is to strike the right balance, which I'm sure we can.