Type Check - Githubissues

AntidoteDB / antidote

A planet scale, highly available, transactional database built on CRDT technology

https://www.antidotedb.eu

Apache License 2.0

831 stars 89 forks source link

Type Check #84

Closed aletomsic closed 8 years ago

aletomsic commented 9 years ago

The way I see it, there are two possible directions to take from here.

The first one, which I think is the best in terms of performance and elegance, would be to perform the type check as soon as a transaction starts. So, what this means is that the type check is the first task a tx coordinator should perform. So, the tx coord would call materialzer:type_check(Operations). The materialiser should check if for each op, the parameters sent are valid. The current implementation presents two problems for performing a complete check: it does not embed the type in the key, so at this stage we can not check that the operations also respect that the type is the same as the one stored for that key (if it exists). it does not include the type in update operations.

The second one (and easiest from the point the implementation is now) would be to perform the type check when generating the downstream operation. This presents the need of reading the key first in order to know its type and then see if the operation is supported and the parameters sent by the update are valid. Checking at this point would be expensive, as maybe many operations could have already generated its downstream. All the process of starting the coordinator, sending correct updates to partitions and then aborting the transaction will be done because of a wrong typed or bad parametrised operation.

I could start by doing the second approach (which could be pretty easy) and then discuss about implementing the first one. I await your comments.

cmeiklejohn commented 9 years ago

I believe that the first approach seems preferred and that's what we should work on.

cmeiklejohn commented 9 years ago

Actually, thinking further through this. Why don't we do like we do with the map and treat the product of type and name as the key itself? So, you could have a Chris G-Counter as well as a Chris OR-Set. I think this greatly reduces the complexity. SwiftCloud's map also took this approach, IIRC.

bieniusa commented 9 years ago

This is the approach that we discussed a couple of month ago.

Annette

On 26 Nov 2014, at 19:02, Christopher Meiklejohn notifications@github.com wrote:

Actually, thinking further through this. Why don't we do like we do with the map and treat the product of type and name as the key itself? So, you could have a Chris G-Counter as well as a Chris OR-Set. I think this greatly reduces the complexity. SwiftCloud's map also took this approach, IIRC.

— Reply to this email directly or view it on GitHub.

marc-shapiro commented 9 years ago

I'm not sure this solves the problem. What stops me, once I have the Chris/Counter key, to misuse it as a Set?

                    Marc

Le 26 nov. 2014 à 19:02, Christopher Meiklejohn notifications@github.com a écrit :

Actually, thinking further through this. Why don't we do like we do with the map and treat the product of type and name as the key itself? So, you could have a Chris G-Counter as well as a Chris OR-Set. I think this greatly reduces the complexity. SwiftCloud's map also took this approach, IIRC.

— Reply to this email directly or view it on GitHub.

cmeiklejohn commented 9 years ago

How would that happen? If you issued an operation to the server for the Chris Set, it's a different key, right?

marc-shapiro commented 9 years ago

Maybe I don't understand the problem.

What is the problem we are trying to solve?

                    Marc

Le 26 nov. 2014 à 21:54, Christopher Meiklejohn notifications@github.com a écrit :

How would that happen? If you issued an operation to the server for the Chris Set, it's a different key, right?

— Reply to this email directly or view it on GitHub.

aletomsic commented 9 years ago

Let me see if I got it right. Your (Chris) idea would be that you can have a "ChrisMap" key and a "ChrisCounter" key. Then:

on writes, an update should now include the type (in the key) and therefore we can perform the type check even before downstream generation.
- on reads, there is no problem as the operations that need to be applied by the materialiser have already been "validated" for that key/type.

Another question: what should we do in the presence of a bad parametrised operation? should we abort the transaction? or just discard that operation and process the rest of the tx (I think this would show strange consequences).

I await your comments.

Best,

Alejandro

On Wed, Nov 26, 2014 at 10:06 PM, marc-shapiro notifications@github.com wrote:

Maybe I don't understand the problem.

What is the problem we are trying to solve?

Marc

Le 26 nov. 2014 à 21:54, Christopher Meiklejohn < notifications@github.com> a écrit :

How would that happen? If you issued an operation to the server for the Chris Set, it's a different key, right?

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/SyncFree/antidote/issues/84#issuecomment-64710580.

marc-shapiro commented 9 years ago

Please, can we define the problem, before we discuss possible solutions?

Is the following a correct description of the environment: An object has a type. This type has methods (interface + implementation). The goal of type checking is to ensure that a given object is accessed only via the correct interface/implementation pair. The object has an ID (== key), which does not change over time; all replicas of the same object have the same ID. Type does not change over time, i.e. the ID <-> type mapping is fixed initially and never changes (or at least, not while the object is in use). I will assume that types are unique and independent, and that we are not trying to solve the subtyping problem. The application uses an object via its ID. The application does not have direct access to the object's internals, only via the interface. The application lives in its own address space. We don't constrain the programming language, therefore its type-checking could be either static or dynamic. The object lives in a persistent database. To access the object, it must first be materialised into an address space (a materialiser). Materialiser space is logically separate from the application.
The materialiser is programmed in Erlang, which has only dynamic type checking. If I understand correctly, the problem we are trying to solve is the following: When bringing the object into the materialiser, to bind it to the code for its methods. To create a communication channel between the app space and the materialiser space, with the correct serialisation/deserialisation stubs in both spaces, and possibly, to satisfy the type-checker in the app space.

Then it seems to me that the right time to type-check is when bringing the object from storage to materialiser, i.e. on the first read. This requires: either (1) that the type information can be extracted from the object (e.g. stored as a field of the object), or (2) the type information can be extracted from the object ID, i.e. the object ID has two parts, a type part and an instance part, and the two are not mixed together opaquely.

Chris, you propose to "treat the product of type and name as the key itself". To me, this sounds like you are hashing them together opaquely into the key, i.e. the type part cannot be readily extracted. If so, it does not satisfy the requirements of (2) above.

                                    Marc

Le 27 nov. 2014 à 14:27, Alejandro Zlatko Tomsic notifications@github.com a écrit :

Let me see if I got it right. Your (Chris) idea would be that you can have a "ChrisMap" key and a "ChrisCounter" key. Then:

on writes, an update should now include the type (in the key) and therefore we can perform the type check even before downstream generation.

on reads, there is no problem as the operations that need to be applied by the materialiser have already been "validated" for that key/type.

Another question: what should we do in the presence of a bad parametrised operation? should we abort the transaction? or just discard that operation and process the rest of the tx (I think this would show strange consequences).

I await your comments.

Best,

Alejandro

On Wed, Nov 26, 2014 at 10:06 PM, marc-shapiro notifications@github.com wrote:

Maybe I don't understand the problem.

What is the problem we are trying to solve?

Marc

Le 26 nov. 2014 à 21:54, Christopher Meiklejohn < notifications@github.com> a écrit :

How would that happen? If you issued an operation to the server for the Chris Set, it's a different key, right?

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/SyncFree/antidote/issues/84#issuecomment-64710580.

— Reply to this email directly or view it on GitHub https://github.com/SyncFree/antidote/issues/84#issuecomment-64790183.

marc-shapiro commented 9 years ago

Re-sending this message, which github appears to have lost.

Le 27 nov. 2014 à 15:28, Marc Shapiro -- at work marc.shapiro@acm.org a écrit :

Please, can we define the problem, before we discuss possible solutions?

Is the following a correct description of the environment: An object has a type. This type has methods (interface + implementation). The goal of type checking is to ensure that a given object is accessed only via the correct interface/implementation pair. The object has an ID (== key), which does not change over time; all replicas of the same object have the same ID. Type does not change over time, i.e. the ID <-> type mapping is fixed initially and never changes (or at least, not while the object is in use). I will assume that types are unique and independent, and that we are not trying to solve the subtyping problem. The application uses an object via its ID. The application does not have direct access to the object's internals, only via the interface. The application lives in its own address space. We don't constrain the programming language, therefore its type-checking could be either static or dynamic. The object lives in a persistent database. To access the object, it must first be materialised into an address space (a materialiser). Materialiser space is logically separate from the application.
The materialiser is programmed in Erlang, which has only dynamic type checking. If I understand correctly, the problem we are trying to solve is the following: When bringing the object into the materialiser, to bind it to the code for its methods. To create a communication channel between the app space and the materialiser space, with the correct serialisation/deserialisation stubs in both spaces, and possibly, to satisfy the type-checker in the app space.

Then it seems to me that the right time to type-check is when bringing the object from storage to materialiser, i.e. on the first read. This requires: either (1) that the type information can be extracted from the object (e.g. stored as a field of the object), or (2) the type information can be extracted from the object ID, i.e. the object ID has two parts, a type part and an instance part, and the two are not mixed together opaquely.

Chris, you propose to "treat the product of type and name as the key itself". To me, this sounds like you are hashing them together opaquely into the key, i.e. the type part cannot be readily extracted. If so, it does not satisfy the requirements of (2) above.
                                  Marc
Le 27 nov. 2014 à 14:27, Alejandro Zlatko Tomsic <notifications@github.com mailto:notifications@github.com> a écrit :

Let me see if I got it right. Your (Chris) idea would be that you can have a "ChrisMap" key and a "ChrisCounter" key. Then:

on writes, an update should now include the type (in the key) and therefore we can perform the type check even before downstream generation.

on reads, there is no problem as the operations that need to be applied by the materialiser have already been "validated" for that key/type.

Another question: what should we do in the presence of a bad parametrised operation? should we abort the transaction? or just discard that operation and process the rest of the tx (I think this would show strange consequences).

I await your comments.

Best,

Alejandro

On Wed, Nov 26, 2014 at 10:06 PM, marc-shapiro <notifications@github.com mailto:notifications@github.com> wrote:

Maybe I don't understand the problem.

What is the problem we are trying to solve?

Marc

Le 26 nov. 2014 à 21:54, Christopher Meiklejohn < notifications@github.com mailto:notifications@github.com> a écrit :

How would that happen? If you issued an operation to the server for the Chris Set, it's a different key, right?

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub <https://github.com/SyncFree/antidote/issues/84#issuecomment-64710580 https://github.com/SyncFree/antidote/issues/84#issuecomment-64710580>.

— Reply to this email directly or view it on GitHub https://github.com/SyncFree/antidote/issues/84#issuecomment-64790183.

aletomsic commented 9 years ago

I think you got it mostly right, but there is an important difference in the definition of the problem: you stated: "When bringing the object into the materialiser, to bind it to the code for its methods."

That is sort of solved in the current implementation. when there is an incorrect operation for the object type, we just omit it in the materialisation process.

The current problem is not even letting the wrong operations to be stored in the log.

marc-shapiro commented 9 years ago

Le 27 nov. 2014 à 17:01, Alejandro Zlatko Tomsic notifications@github.com a écrit :

I think you got it mostly right, but there is an important difference in the definition of the problem: you stated: "When bringing the object into the materialiser, to bind it to the code for its methods."

That is sort of solved in the current implementation. when there is an incorrect operation for the object type, we just omit it in the materialisation process.

The current problem is not even letting the wrong operations to be stored in the log.

I still don't get it. If the object is bound to the correct code, how could the correct code log wrong operations?

                Marc

bieniusa commented 9 years ago

Is the following a correct description of the environment: An object has a type. This type has methods (interface + implementation). The goal of type checking is to ensure that a given object is accessed only via the correct interface/implementation pair.

The datastore stores objects from different datatypes. Each data type defines an interface through which the objects can be accessed / read and possibly modified. In our case here, the interface is given in terms of messages that an object accepts and of messages that are send as replies. (The underlying implementation should be hidden from the user / programmer).

In case an object receives a message that it is not part of its interface, the corresponding operation should not be applied to the object as this will result in an type error. Instead the corresponding transaction should be aborted and the sender should be notified of an error.

The object has an ID (== key), which does not change over time; all replicas of the same object have the same ID. Type does not change over time, i.e. the ID <-> type mapping is fixed initially and never changes (or at least, not while the object is in use). I will assume that types are unique and independent, and that we are not trying to solve the subtyping problem.

Yes.

The application uses an object via its ID. The application does not have direct access to the object's internals, only via the interface. The application lives in its own address space. We don't constrain the programming language, therefore its type-checking could be either static or dynamic.

Yes.

The object lives in a persistent database. To access the object, it must first be materialised into an address space (a materialiser). Materialiser space is logically separate from the application. The materialiser is programmed in Erlang, which has only dynamic type checking.

Hmmm. The materialiser is only a function which applies the messages to the object. The address space is the key space of the datastore. The problem does not stem from Erlang’s type checking but from the fact that the datastore and the application communicate via (untyped) channels.

If I understand correctly, the problem we are trying to solve is the following: When bringing the object into the materialiser, to bind it to the code for its methods. To create a communication channel between the app space and the materialiser space, with the correct serialisation/deserialisation stubs in both spaces, and possibly, to satisfy the type-checker in the app space.

What do you mean with „bringing the object into the materialiser“? The materialiser is stateless, its associated cache should (according to Alejandro) always have some object instantiation.

Then it seems to me that the right time to type-check is when bringing the object from storage to materialiser, i.e. on the first read. This requires: either (1) that the type information can be extracted from the object (e.g. stored as a field of the object), or (2) the type information can be extracted from the object ID, i.e. the object ID has two parts, a type part and an instance part, and the two are not mixed together opaquely.

What we have to do for the type-check, is to check if the method associated with the read/update message is applicable to an object. So, there must be a mapping from object to its type, as you describe, either by writing the information as part of the object, or as part of its key.

Chris, you propose to "treat the product of type and name as the key itself". To me, this sounds like you are hashing them together opaquely into the key, i.e. the type part cannot be readily extracted. If so, it does not satisfy the requirements of (2) above.

The idea here is that the user / caller only knows the ID/name for some object. To locate now the object in the data base / cache, we hash this object ID together with the expected type, thus "treating type and name as key“. Either there is an object under this hashed key to be found, then we know that it has the correct type. Or there is no object, then we know that the caller got the type (or the name) wrong. In either case, this should result in an error. Additionally, this requires to have a typed object initialisation method.

Annette

Le 27 nov. 2014 à 14:27, Alejandro Zlatko Tomsic notifications@github.com a écrit :

Let me see if I got it right. Your (Chris) idea would be that you can have a "ChrisMap" key and a "ChrisCounter" key. Then:

on writes, an update should now include the type (in the key) and therefore we can perform the type check even before downstream generation.

on reads, there is no problem as the operations that need to be applied by the materialiser have already been "validated" for that key/type.

Another question: what should we do in the presence of a bad parametrised operation? should we abort the transaction? or just discard that operation and process the rest of the tx (I think this would show strange consequences).

I await your comments.

Best,

Alejandro

On Wed, Nov 26, 2014 at 10:06 PM, marc-shapiro notifications@github.com wrote:

Maybe I don't understand the problem.

What is the problem we are trying to solve?

Marc

Le 26 nov. 2014 à 21:54, Christopher Meiklejohn < notifications@github.com> a écrit :

How would that happen? If you issued an operation to the server for the Chris Set, it's a different key, right?

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/SyncFree/antidote/issues/84#issuecomment-64710580.

— Reply to this email directly or view it on GitHub https://github.com/SyncFree/antidote/issues/84#issuecomment-64790183.

— Reply to this email directly or view it on GitHub.

tcrain commented 9 years ago

I sent the below message earlier, but I guess I didn't send it correctly, so am trying again:

On Thu, Nov 27, 2014 at 3:30 PM, Tyler Crain tyler.crain@lip6.fr wrote:

On Thu, Nov 27, 2014 at 2:27 PM, Alejandro Zlatko Tomsic < notifications@github.com> wrote:

Let me see if I got it right. Your (Chris) idea would be that you can have a "ChrisMap" key and a "ChrisCounter" key. Then:

on writes, an update should now include the type (in the key) and therefore we can perform the type check even before downstream generation.

on reads, there is no problem as the operations that need to be applied by the materialiser have already been "validated" for that key/type.

This seems nice.

It is also nice for when two objects are created concurrently at different DCs, then you are guaranteed that concurrent creations of the same key are of the same type.

Maybe you already have different semantics for object creations though? Are you allowed to create objects at different DCs with the same key concurrently? Otherwise you could also embed the DC id in the key from the DC where the object was created?

Another question: what should we do in the presence of a bad parametrised operation? should we abort the transaction? or just discard that operation and process the rest of the tx (I think this would show strange consequences).

Abort seems to be the right thing to do.

And another question: how do you deal with reads that are performed after writes in the same transaction, i.e. TX_start, read(x), update(x), read(x), ..... TX_end I guess the new value should be generated by the materializer and sent back to the client?

-Tyler

I await your comments.

Best,

Alejandro

On Wed, Nov 26, 2014 at 10:06 PM, marc-shapiro <notifications@github.com

wrote:

Maybe I don't understand the problem.

What is the problem we are trying to solve?

Marc

Le 26 nov. 2014 à 21:54, Christopher Meiklejohn < notifications@github.com> a écrit :

How would that happen? If you issued an operation to the server for the Chris Set, it's a different key, right?

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub <https://github.com/SyncFree/antidote/issues/84#issuecomment-64710580 .

— Reply to this email directly or view it on GitHub https://github.com/SyncFree/antidote/issues/84#issuecomment-64790183.

marc-shapiro commented 9 years ago

Le 27 nov. 2014 à 17:42, bieniusa notifications@github.com a écrit :

The materialiser is only a function which applies the messages to the object. The address space is the key space of the datastore. The problem does not stem from Erlang’s type checking but from the fact that the datastore and the application communicate via (untyped) channels.

You seem to be saying that we never need to worry about the concrete implementation of the datatype. That's true for the client, but not for the materialiser: it needs to respond to the interface messages concretely, so it needs an implementation.

If I understand correctly, the problem we are trying to solve is the following: When bringing the object into the materialiser, to bind it to the code for its methods. To create a communication channel between the app space and the materialiser space, with the correct serialisation/deserialisation stubs in both spaces, and possibly, to satisfy the type-checker in the app space.

What do you mean with „bringing the object into the materialiser“? The materialiser is stateless, its associated cache should (according to Alejandro) always have some object instantiation.

That's the whole issue, isn't it: ensuring that the materialsed cache executes the correct implementation.

Then it seems to me that the right time to type-check is when bringing the object from storage to materialiser, i.e. on the first read. This requires: either (1) that the type information can be extracted from the object (e.g. stored as a field of the object), or (2) the type information can be extracted from the object ID, i.e. the object ID has two parts, a type part and an instance part, and the two are not mixed together opaquely.

What we have to do for the type-check, is to check if the method associated with the read/update message is applicable to an object. So, there must be a mapping from object to its type, as you describe, either by writing the information as part of the object, or as part of its key.

Chris, you propose to "treat the product of type and name as the key itself". To me, this sounds like you are hashing them together opaquely into the key, i.e. the type part cannot be readily extracted. If so, it does not satisfy the requirements of (2) above.

The idea here is that the user / caller only knows the ID/name for some object. To locate now the object in the data base / cache, we hash this object ID together with the expected type, thus "treating type and name as key“. Either there is an object under this hashed key to be found, then we know that it has the correct type.

I don't this works correctly. There can be many implementations (concrete data types) that respond to the same interface. The application only knows the interface type, not the implementation type.

(In addition, I am very uncomfortable with identification by hashing with no way to check for hash collisions.)

                    Marc

cmeiklejohn commented 9 years ago

Am I correct in thinking the following is a valid issue in the current implementation?

Since keys are created when the first operation is performed on them, if we do not concatenate the key with the type, we run into the situation where concurrent creations of the same key with different types is based on whichever operation arrives at the data store first, correct?

bieniusa commented 9 years ago

On 28.11.2014, at 10:47, marc-shapiro notifications@github.com wrote:

Le 27 nov. 2014 à 17:42, bieniusa notifications@github.com a écrit :

The materialiser is only a function which applies the messages to the object. The address space is the key space of the datastore. The problem does not stem from Erlang’s type checking but from the fact that the datastore and the application communicate via (untyped) channels.

You seem to be saying that we never need to worry about the concrete implementation of the datatype. That's true for the client, but not for the materialiser: it needs to respond to the interface messages concretely, so it needs an implementation.

I am not saying that we don’t need to worry about the concrete implementation. We have implementations for the different datatypes, and they should be hopefully correct according to the CRDT semantics. We need to worry about the mapping of message and CRDT update/read function.

If I understand correctly, the problem we are trying to solve is the following: When bringing the object into the materialiser, to bind it to the code for its methods. To create a communication channel between the app space and the materialiser space, with the correct serialisation/deserialisation stubs in both spaces, and possibly, to satisfy the type-checker in the app space.

What do you mean with „bringing the object into the materialiser“? The materialiser is stateless, its associated cache should (according to Alejandro) always have some object instantiation.

That's the whole issue, isn't it: ensuring that the materialsed cache executes the correct implementation.

And testing that there is a corresponding implementation. And that the types of the parameter match.

Then it seems to me that the right time to type-check is when bringing the object from storage to materialiser, i.e. on the first read. This requires: either (1) that the type information can be extracted from the object (e.g. stored as a field of the object), or (2) the type information can be extracted from the object ID, i.e. the object ID has two parts, a type part and an instance part, and the two are not mixed together opaquely.

What we have to do for the type-check, is to check if the method associated with the read/update message is applicable to an object. So, there must be a mapping from object to its type, as you describe, either by writing the information as part of the object, or as part of its key.

Chris, you propose to "treat the product of type and name as the key itself". To me, this sounds like you are hashing them together opaquely into the key, i.e. the type part cannot be readily extracted. If so, it does not satisfy the requirements of (2) above.

The idea here is that the user / caller only knows the ID/name for some object. To locate now the object in the data base / cache, we hash this object ID together with the expected type, thus "treating type and name as key“. Either there is an object under this hashed key to be found, then we know that it has the correct type.

I don't this works correctly. There can be many implementations (concrete data types) that respond to the same interface. The application only knows the interface type, not the implementation type.

No, every datatypes need its own interface. As you said, we shouldn’t try to solve polymorphism, sub-typing, etc. here. The interface specifically contains the concrete object type. E.g. gcounter_incr, orset_remove etc. The application must be specific about the type.

(In addition, I am very uncomfortable with identification by hashing with no way to check for hash collisions.)

I agree to this, but it shouldn’t be such a big deal to add the type information also to the object payload.

Annette

bieniusa commented 9 years ago

On 28.11.2014, at 18:07, Christopher Meiklejohn notifications@github.com wrote:

Am I correct in thinking the following is a valid issue in the current implementation?

Since keys are created when the first operation is performed on them, if we do not concatenate the key with the type, we run into the situation where concurrent creations of the same key with different types is based on whichever operation arrives at the data store first, correct?

— Reply to this email directly or view it on GitHub.

I am afraid so, yes.

Annette=

tcrain commented 9 years ago

Some thoughts in case they are interesting:

I would say that everything should sanitized and checked before it is stored to the log, otherwise you start getting into all sorts of security dangers (think sql injection etc).

Since there is no centralized key manager, keys should be generated using things that can be unique.

One way then to identify an object would be matching its name, the application it is associated with, and its type (of course a hash of this would only tell approximately where to find this object and not be used for identification). If you don't have this information then you cannot update the object.

When an update is performed, the server will check to make sure that the client has the rights to access the keys of this application, and that the operation is valid for the provided type before logging.

This allows concurrent users of the same application to create the same object at the same time at different DCs (which maybe is nice if you want the programming model to allow something like this). Otherwise if you don't want this, also the id of the DC where the object was created (the first update) can be added as part of its identification (which also has to be checked/sanitized).

-Tyler

On Fri, Nov 28, 2014 at 6:16 PM, bieniusa notifications@github.com wrote:

On 28.11.2014, at 18:07, Christopher Meiklejohn notifications@github.com wrote:

Am I correct in thinking the following is a valid issue in the current implementation?

Since keys are created when the first operation is performed on them, if we do not concatenate the key with the type, we run into the situation where concurrent creations of the same key with different types is based on whichever operation arrives at the data store first, correct?

— Reply to this email directly or view it on GitHub.

I am afraid so, yes.

Annette=

— Reply to this email directly or view it on GitHub https://github.com/SyncFree/antidote/issues/84#issuecomment-64914764.

marc-shapiro commented 9 years ago

I don't this works correctly. There can be many implementations (concrete data types) that respond to the same interface. The application only knows the interface type, not the implementation type.

No, every datatypes need its own interface. As you said, we shouldn’t try to solve polymorphism, sub-typing, etc. here. The interface specifically contains the concrete object type. E.g. gcounter_incr, orset_remove etc. The application must be specific about the type.

There can be many implementations of gcounter and therefore of gcounter_incr. We need to map the implementation of the data, brought in from disk, with the right implementation of the code. There is no way the client could predict what is the correct concrete type.

        Marc

bieniusa commented 9 years ago

On 28.11.2014, at 20:45, marc-shapiro notifications@github.com wrote:

I don't this works correctly. There can be many implementations (concrete data types) that respond to the same interface. The application only knows the interface type, not the implementation type.

No, every datatypes need its own interface. As you said, we shouldn’t try to solve polymorphism, sub-typing, etc. here. The interface specifically contains the concrete object type. E.g. gcounter_incr, orset_remove etc. The application must be specific about the type.

There can be many implementations of gcounter and therefore of gcounter_incr. We need to map the implementation of the data, brought in from disk, with the right implementation of the code. There is no way the client could predict what is the correct concrete type.

Is this really a problem? The client does not need to predict anything about the implementation, it just issues an operation. Referring to an interface allows to keep the information about the implementation hidden.

Within a DC or even across DCs, I think we can safely assume that we only have one implementation per datatype, no?

Annette

marc-shapiro commented 9 years ago

Le 1 déc. 2014 à 10:50, bieniusa notifications@github.com a écrit :

There can be many implementations of gcounter and therefore of gcounter_incr. We need to map the implementation of the data, brought in from disk, with the right implementation of the code. There is no way the client could predict what is the correct concrete type.

Is this really a problem? The client does not need to predict anything about the implementation, it just issues an operation.

The end client doesn't care of course, but the key does comes from the client.

The materializer does care, a lot, and has to do be able, either to extract the information from the key received from the client, or has to store the information in some metadata associated with the key.

[...].

Within a DC or even across DCs, I think we can safely assume that we only have one implementation per datatype, no?

We could make this simplifying assumption, but it's a very dangerous one, because it means we can't evolve implementations over time, nor support two implementations of the same abstraction. Reminds me of Excel not being able to open two files that happen to have the same name...

                                            Marc

marc-shapiro commented 9 years ago

The way type-check works in Riak 2.0 is (if I understand correctly) that a given bucket can contain only objects of a certain type. I.e. the type of an object can be discovered by querying its bucket. This is a simple implementation of storing type information in metadata associated with the key.

Why not follow their example?

                    Marc

Le 1 déc. 2014 à 11h55, Marc Shapiro -- at work marc.shapiro@acm.org a écrit :

Le 1 déc. 2014 à 10:50, bieniusa notifications@github.com a écrit :

There can be many implementations of gcounter and therefore of gcounter_incr. We need to map the implementation of the data, brought in from disk, with the right implementation of the code. There is no way the client could predict what is the correct concrete type.

Is this really a problem? The client does not need to predict anything about the implementation, it just issues an operation.

The end client doesn't care of course, but the key does comes from the client.

The materializer does care, a lot, and has to do be able, either to extract the information from the key received from the client, or has to store the information in some metadata associated with the key.

[...].

Within a DC or even across DCs, I think we can safely assume that we only have one implementation per datatype, no?

We could make this simplifying assumption, but it's a very dangerous one, because it means we can't evolve implementations over time, nor support two implementations of the same abstraction. Reminds me of Excel not being able to open two files that happen to have the same name...
                                          Marc

cmeiklejohn commented 9 years ago

This is correct. Each bucket can only store one type -- in the case of the map, the map needs to store type information as well since it can embed different types.

I'm fine with this approach, it's straight forward and works.

However, it should be clear that the type information in the bucket is only used by the system itself to know how to operate over the data on disk -- we never return the CRDT data structures back to the user -- just the result of the query method as called by the system.

bieniusa commented 9 years ago

Well, this is conceptually exactly what we were proposing.

Annette

On 02.12.2014, at 17:38, marc-shapiro notifications@github.com wrote:

The way type-check works in Riak 2.0 is (if I understand correctly) that a given bucket can contain only objects of a certain type. I.e. the type of an object can be discovered by querying its bucket. This is a simple implementation of storing type information in metadata associated with the key.

Why not follow their example?

Marc

Le 1 déc. 2014 à 11h55, Marc Shapiro -- at work marc.shapiro@acm.org a écrit :

Le 1 déc. 2014 à 10:50, bieniusa notifications@github.com a écrit :

There can be many implementations of gcounter and therefore of gcounter_incr. We need to map the implementation of the data, brought in from disk, with the right implementation of the code. There is no way the client could predict what is the correct concrete type.

Is this really a problem? The client does not need to predict anything about the implementation, it just issues an operation.

The end client doesn't care of course, but the key does comes from the client.

The materializer does care, a lot, and has to do be able, either to extract the information from the key received from the client, or has to store the information in some metadata associated with the key.

[...].

Within a DC or even across DCs, I think we can safely assume that we only have one implementation per datatype, no?

We could make this simplifying assumption, but it's a very dangerous one, because it means we can't evolve implementations over time, nor support two implementations of the same abstraction. Reminds me of Excel not being able to open two files that happen to have the same name...

Marc

— Reply to this email directly or view it on GitHub https://github.com/SyncFree/antidote/issues/84#issuecomment-65260393.

bieniusa commented 9 years ago

I think this conceptually the same as we were proposing - on per key base as we don’t have buckets in Antidote.

Annette

On 02.12.2014, at 17:38, marc-shapiro notifications@github.com wrote:

The way type-check works in Riak 2.0 is (if I understand correctly) that a given bucket can contain only objects of a certain type. I.e. the type of an object can be discovered by querying its bucket. This is a simple implementation of storing type information in metadata associated with the key.

Why not follow their example?

Marc

Le 1 déc. 2014 à 11h55, Marc Shapiro -- at work marc.shapiro@acm.org a écrit :

Le 1 déc. 2014 à 10:50, bieniusa notifications@github.com a écrit :

There can be many implementations of gcounter and therefore of gcounter_incr. We need to map the implementation of the data, brought in from disk, with the right implementation of the code. There is no way the client could predict what is the correct concrete type.

Is this really a problem? The client does not need to predict anything about the implementation, it just issues an operation.

The end client doesn't care of course, but the key does comes from the client.

The materializer does care, a lot, and has to do be able, either to extract the information from the key received from the client, or has to store the information in some metadata associated with the key.

[...].

Within a DC or even across DCs, I think we can safely assume that we only have one implementation per datatype, no?

We could make this simplifying assumption, but it's a very dangerous one, because it means we can't evolve implementations over time, nor support two implementations of the same abstraction. Reminds me of Excel not being able to open two files that happen to have the same name...

Marc

— Reply to this email directly or view it on GitHub https://github.com/SyncFree/antidote/issues/84#issuecomment-65260393.

aletomsic commented 9 years ago

Proposed solution for the new interface (see Deepthi’s thread “Andidote application/materialiser interface”):

If I understand correctly, the new interface will:

1) create buckets containing objects of the same type. How will this be implemented? 2) bind an object on its creation to a certain type. 3) read and update operations will not take as a parameter the object type, as it will be embedded on the object's bucket.

Solution: Whenever an object is handled inappropriately, a transaction will abort. Inappropriate handling includes (please contribute to completing this list if I am missing anything):

binding a key to a type that does not exist.
using an undefined method for a given type, e.g. trying to increment a set.
using a defined method's type with incorrect parameters.
issuing an operation on a not bound key.

Remark: binding an already bound key simply generates a new key in a different bucket.

Implementation:

binding a key to a type that does not exist.

issue a type:new() operation at the coordinator. In case of success, continue; otherwise, abort the txn.

using an undefined method for a given type, e.g. trying to increment a set. using a defined method's type with incorrect parameters.

for these cases, issue a type:operation(params) at the coordinator. Abort if fails.

issuing an operation on a not bound key.

this is the only tricky check, as it requires checking that the object exists, which means contacting the partition that stores the object. my suggestion is to perform this check optimistically, at downstream generation. It could be too late, but this way we would not incur in significant overhead. for operations that do not require downstream generation, this check would need to perform a read to the object. if the case is that the object does not exist (has not been bound), the partition will send an abort message to the coordinator (this is done at the prepare face, when partitions vote on committing or aborting a txn), that will abort the transaction.

On Jun 30, 2015, at 14:27, Alejandro Z. Tomsic aletomsic@gmail.com wrote:

Hello,

I am retaking this thread to propose a solution to the problem. The solution depends on the issue regarding the interface.

Alejandro Z. Tomsic aletomsic@gmail.com

On Dec 2, 2014, at 18:00, bieniusa notifications@github.com wrote:

I think this conceptually the same as we were proposing - on per key base as we don’t have buckets in Antidote.

Annette

On 02.12.2014, at 17:38, marc-shapiro notifications@github.com wrote:

The way type-check works in Riak 2.0 is (if I understand correctly) that a given bucket can contain only objects of a certain type. I.e. the type of an object can be discovered by querying its bucket. This is a simple implementation of storing type information in metadata associated with the key.

Why not follow their example?

Marc

Le 1 déc. 2014 à 11h55, Marc Shapiro -- at work marc.shapiro@acm.org a écrit :

Le 1 déc. 2014 à 10:50, bieniusa notifications@github.com a écrit :

There can be many implementations of gcounter and therefore of gcounter_incr. We need to map the implementation of the data, brought in from disk, with the right implementation of the code. There is no way the client could predict what is the correct concrete type.

Is this really a problem? The client does not need to predict anything about the implementation, it just issues an operation.

The end client doesn't care of course, but the key does comes from the client.

The materializer does care, a lot, and has to do be able, either to extract the information from the key received from the client, or has to store the information in some metadata associated with the key.

[...].

Within a DC or even across DCs, I think we can safely assume that we only have one implementation per datatype, no?

We could make this simplifying assumption, but it's a very dangerous one, because it means we can't evolve implementations over time, nor support two implementations of the same abstraction. Reminds me of Excel not being able to open two files that happen to have the same name...

Marc

— Reply to this email directly or view it on GitHub https://github.com/SyncFree/antidote/issues/84#issuecomment-65260393.

— Reply to this email directly or view it on GitHub https://github.com/SyncFree/antidote/issues/84#issuecomment-65264545.

aletomsic commented 9 years ago

Annette, Marc,

The previous message explains how I am going to do it. Starting now. Should not take much time.

aletomsic commented 9 years ago

pull request #142

What I've done:

In the previous email I pointed out the problems that I am going to tackle:

1) binding a key to a type that does not exist. 2) using an undefined method for a given type, e.g. trying to increment a set. 3) using a defined method's type with incorrect parameters. 4) issuing an operation on a not bound key.

1 and 4) can not be solved right now as that functionality is not still included. Deepthi is working on the new interface, that does. Once the new interface is in place, that must be addressed.

2 and 3) addressed by this solution. I've created a function in the materialiser module which verifies for each operation being issued, if it's correctly typed, by performing that operation on an empty crdt. This function is now called in antidote.erl, before starting a coordinator fsm.

cmeiklejohn commented 9 years ago

Open question:

Why does this functionality exist in the materializer? This means that the operation can exist in the log, but not be applied. Why are we not preventing these operations from being inserted into the log?

On Mon, Sep 14, 2015 at 4:05 AM, Alejandro Zlatko Tomsic < notifications@github.com> wrote:

pull request #142 https://github.com/SyncFree/antidote/pull/142

What I've done:

In the previous email I pointed out the problems that I am going to tackle:

1) binding a key to a type that does not exist. 2) using an undefined method for a given type, e.g. trying to increment a set. 3) using a defined method's type with incorrect parameters. 4) issuing an operation on a not bound key.

1 and 4) can not be solved right now as that functionality is not still included. Deepthi is working on the new interface, that does. Once the new interface is in place, that must be addressed.

2 and 3) addressed by this solution. I've created a function in the materialiser module which verifies for each operation being issued, if it's correctly typed, by performing that operation on an empty crdt. This function is now called in antidote.erl, before starting a coordinator fsm.

— Reply to this email directly or view it on GitHub https://github.com/SyncFree/antidote/issues/84#issuecomment-140040685.

aletomsic commented 9 years ago

El lunes, 14 de septiembre de 2015, Christopher S. Meiklejohn < notifications@github.com> escribió:

Open question:

Why does this functionality exist in the materializer?

It exists in the materializer library, and not in the vnode. It is done there because the type check performs a check which is dependent on an object type and its operations with their parameters.

This means that the operation can exist in the log, but not be applied.

No, it can never reach the log, as the check is performed before sending an operation to a transaction coordinator.

Why are we not preventing these operations from being inserted into the log?

On Mon, Sep 14, 2015 at 4:05 AM, Alejandro Zlatko Tomsic < notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

pull request #142 https://github.com/SyncFree/antidote/pull/142

What I've done:

In the previous email I pointed out the problems that I am going to tackle:

1) binding a key to a type that does not exist. 2) using an undefined method for a given type, e.g. trying to increment a set. 3) using a defined method's type with incorrect parameters. 4) issuing an operation on a not bound key.

1 and 4) can not be solved right now as that functionality is not still included. Deepthi is working on the new interface, that does. Once the new interface is in place, that must be addressed.

2 and 3) addressed by this solution. I've created a function in the materialiser module which verifies for each operation being issued, if it's correctly typed, by performing that operation on an empty crdt. This function is now called in antidote.erl, before starting a coordinator fsm.

— Reply to this email directly or view it on GitHub https://github.com/SyncFree/antidote/issues/84#issuecomment-140040685.

— Reply to this email directly or view it on GitHub https://github.com/SyncFree/antidote/issues/84#issuecomment-140135357.

cmeiklejohn commented 9 years ago

Ah, alright. Do we need to worry about concurrent transactions over the same key with different data types, given they won't merge correctly?

aletomsic commented 9 years ago

This case should be solves when we start embedding the type within an object’s key. Is that part of the interface change deepthi is working on?

On Sep 14, 2015, at 20:20, Christopher S. Meiklejohn notifications@github.com wrote:

Ah, alright. Do we need to worry about concurrent transactions over the same key with different data types, given they won't merge correctly?

— Reply to this email directly or view it on GitHub https://github.com/SyncFree/antidote/issues/84#issuecomment-140165535.

cmeiklejohn commented 9 years ago

:+1:, embedding the type in the key is a great solution.