Open bedeho opened 2 years ago
Thanks a lot for this writeup @bedeho, really helpful, now everything has clicked for me.
One quick concern, does serialized protobuf message actually identifies itself somehow? That is, assuming we have two separate messages, that coincidentally share the same list of properties, with exactly the same types, is it possible to tell which kind of message that is?
Given
message CreatePost {
uint64 thread_id = 1;
string body = 2;
}
message DeleteThread {
uint64 thread_id = 1;
string rationale = 2;
}
won't CreatePost(1, 'test')
and DeleteThread(1, 'test')
serialize to the same bytes?
I thought of this, also from an efficiency point of view, as it would be better to some how be able to just lookup what message was supposed to be, in constant time using some sort of message ID value, by some sort of lookup table or something, rather than this trial-and-error approach, but here my protobuf knowledge ends. This would start to become really important when people start adding their own random messages to support lots of random things a given query node does not recognize, so that it can efficiently skip messages. I assumed that they were indeed self-identifying, as per your question, but I don't actually know.
@mnaamani @Lezek123
A great question raised, @kdembler.
@bedeho determining the type of message by some sort of message ID in constant time seems a feasible solution. We can standardize message format e.g. every message should have a mandatory, unique(across messages set), first type_id
field.
message CreatePost {
uint32 type_id = 1;
uint64 thread_id = 2;
string body = 3;
}
For each serialized message, we can read the first few bytes, then decode them, to get message ID(I did something similar in creator payouts PR). It is much better than the trial-and-error approach, especially if the Events messages set is huge.
For each serialized message, we can read the first few bytes, then decode them, to get message ID(I did something similar in creator payouts PR). It is much better than the trial-and-error approach, especially if the Events messages set is huge.
I really would want to avoid this sort of thing, as we would be overriding native serialization features of protocol buffers doing this, we should explore what the advised approach is here, we are not the first party to consider this. I would say that it would be better to do the naive approach rather than this sort of overriding, as I the computational cost is actually unlikely to be of any significance, as this sort of deserialization is very fast, and will fail fast also I assume.
@bedeho
if we even keep it: council blog.
Are there plans to remove this feature? Or am I misunderstanding?
@zeeshanakram3 @kdembler
Ok, so I got confirmation from the guys that indeed the serialization format does not identify messages directly, an in fact @Lezek123 is already dealing with this when it comes to different messages relating to working group status updates (status vs upcoming opening).
So just to state it clearly: When a message is serialized, only the field values are reflected, hence if someone attempts to deserialize using a message with sequence of fields that somehow can be recovered from this serialization - where the trivial case is an identical message, then the deserialziation will succeed, and mappings will break.
The complication here is that we precisely want an open set of third parties to introduce their own messages & protocols, so some sort of coordination will be required. I believe @zeeshanakram3 solution is unavoidable, so all messages must have a message identifier field, and there has to be a global repositories of occupied messages, so that different use cases can avoid stepping on each others toes. This could for example be part of the constitution of the DAO, so that it is public and subject to some coordination, but fundamentally it is still permissionless.
I will update the example above to reflect this.
Thank you to both of you for highlighting 👍
Are there plans to remove this feature? Or am I misunderstanding?
At least as runtime module, but yes, even just as a metaprocol, I don't see us adding proposals for this, nor adding support in Pioneer.
@bedeho seems like we can use the out-of-the-box feature provided by Protobuf to determine the type of the message from the serialized payload. Protobuf provides oneof primitive to select at most one field/message from a set of given fields/messages
The oneof
primitive uses the (almost) same approach we are exploring here (i.e., storing type Id information along with the serialized message). While decoding, protobuf uses the field number info from the message schema definition to deserialize raw bytes into the correct type.
oneof
primitive provides compile-time guarantees that no conflicting field numbers exist amongst given message set, while achieving that in case of custom message ID requires some sort of manual interventionLet's compare protobuf schemas for two approaches side by side (Limited only to two types for better readability).
Custom Message IDs | `oneof` approach |
---|---|
message CreatePost {
required uint32 message_id = 1;
required uint64 thread_id = 2;
required string body = 3;
}
|
message CreatePost {
required uint64 thread_id = 1;
required string body = 2;
}
|
To ensure that, after adding a new message into oneof set, the new code keeps parsing the old serialized payload/events correctly, field numbers of already existing messages must not change.
If some message is removed from the oneof set, then the field number used by that message must be rendered unusable to avoid unpredictable parsing results. Data corruption and/or bugs can occur, If the field number is used by some new identical message after being freed, since deserialization wouldn't fail which it should.
This can be achieved using the reserved
fields feature. Given the MemberRemarked
message, field number 3 can be reserved as:
message MemberRemarked {
reserved 3;
oneof member_remarked {
CreatePost create_post = 1;
DeleteThread delete_thread = 2;
}
}
Are there plans to remove this feature? Or am I misunderstanding?
At least as runtime module, but yes, even just as a metaprocol, I don't see us adding proposals for this, nor adding support in Pioneer.
(I will just ask this here, since it will clarify whether its worth making a separate issue or not)
council blog
feature would cease to exist?(I will just ask this here, since it will clarify whether its worth making a separate issue or not)
This is really not the place for this topic, this issue is meant an example issue that people use as a reference.
The oneof primitive uses the (almost) same approach we are exploring here (i.e., storing type Id information along with the serialized message). While decoding, protobuf uses the field number info from the message schema definition to deserialize raw bytes into the correct type.
@zeeshanakram3
How does this deal with the case that different people have different sets of messages they are sending over the same extrinsic? It seems that with this approach you could still mistakenly process a message with different semantics, that you did not know about, because someone else is playing some other protocol that you don't know about.
It seems that with this approach you could still mistakenly process a message with different semantics, that you did not know about
@bedeho Yes, since without proper communication/standardization there is no way of knowing that how users/devs have defined their message sets.
The remedy is, as long as people don't use conflicting fields numbers for different messages having identical binary representation, the mapping wont fail.
Suppose user A
and B
have their own independent message sets, and are sending messages over the same extrinsic. Also the field numbers used by the messages of user A
are in range m..n
, and by messages of user B
are in range x..y
. Also these ranges are disjoint, so user A
would simply ignore the messages serialized by B
and vice versa.
Again, the benefits of this approach is that, first, this is not an anti pattern technique and second, the responsibility of assigning the correct value to message id field during serialization is delegated from user to the code.
there has to be a global repositories of occupied messages, so that different use cases can avoid stepping on each others toes.
As you said, essentially this would be required.
The remedy is, as long as people don't use conflicting fields numbers for different messages having identical binary representation, the mapping wont fail.
Can you give an example?
User A messages | User B messages |
---|---|
message CreatePost {
required uint64 thread_id = 1;
required string body = 2;
}
|
message CreateComment {
required uint64 video_id = 1;
required string body = 2;
}
|
@bedeho Assuming sample messages, although CreatePost
& CreateComment
would have identical binary representation, the field numbers used by the messages of user A
([11, 12]
) & B
([21, 22]
) are not conflicting so the mappings won't fail. If user A
receives user B
messages it can ignore them because it did not recognize the field numbers and vice versa.
@zeeshanakram3
So in both approaches (embedded type field vs. oneof with unique field ids), there would need to be a globally maintained list of message IDs, and new devs would have to make sure to not use old IDs for their new messages, and also share these new IDs with the community.?
If that is the case, then what is the benefit of the latter over the former?
I can think of the following, but not sure if those are the primary ones
if (x= CreateComment::deserialize(buffer) && x.type_id == CREATE_COMMENT_ID)
...
else if(...)
...
, as _perhaps_ the standard proto library can deserialize into some enumerated representation mirroring `MemberRemarked` ? **Again, I am just speculating.**
2. The effective ID of each message is part of the schema itself, rather than living in some separate place that everyone must keep up to date with, and then import into their client side code when they want to work with such messages.
Anything else?
Would it be worth trying to write a some test code that just confirms that you are right about this specific scenario? Just to see that there would be no confusion between messages of A and B?
So in both approaches (embedded type field vs. oneof with unique field ids), there would need to be a globally maintained list of message IDs, and new devs would have to make sure to not use old IDs for their new messages, and also share these new IDs with the community.?
Yes
If that is the case, then what is the benefit of the latter over the former?
Technical/performance-wise, probably not any, in fact, both approaches would have identical binary representations(same number of bytes when serialized). It's just that the latter approach is easier to work with/ less bug-prone, when creating a message you don't need to set metadata information(which in the case of the former approach would be message_id
). e.g.
Embedded type field | Oneof set |
---|---|
const msg: CreateComment = {
message_id: CREATE_COMMENT_ID,
video_id: 1,
body: "some comment",
}
CreateComment.encode(msg)
|
const msg: CreateComment = {
video_id: 1,
body: "some comment",
}
// here when serializing protobuf code automatically prepends
// field number of `CreateComment` in MemberRemarked oneof set
MemberRemarked.encode(msg)
|
as perhaps the standard proto library can deserialize into some enumerated representation mirroring MemberRemarked?
Does following from protobuf docs answers this question: You can check which value in a oneof is set (if any) using a special case() or WhichOneof() method, depending on your chosen language.
Would it be worth trying to write a some test code that just confirms that you are right about this specific scenario? Just to see that there would be no confusion between messages of A and B?
I will write the code and share the findings.
Excellent, and not having to set the type values manually when constructing message is actually quite huge.
If your test works out, let us go for this for now. You will be hte first to try this, if it works, we will update the example, or probably create a new one, that the rest of the team can follow, such as @kdembler who is going to do something similar with playlists very soon I understand.
Tested given scenario for messages of user A
& B
described in here
For testing code, I changed user A
& B
's MemberRemarked
message name to MemberRemarkedA
& MemberRemarkedB
respectively.
For a detailed description of different cases please have a look at this code gist.
https://gist.github.com/zeeshanakram3/1fcd75ab5da9b01a710bd41ac2eb3eea
During deserialization, both users successfully ignore messages whose type values are unknown to them. Here ignore means that the decode
function returns an empty object.
Nice!
I think one big piece that we haven't addressed yet is error handling. I was thinking about this simple approach:
I think this should be quite simple both on QN and client side. Any thoughts?
Edit: Actually thinking about this, success states are as important as error states. So for example ability to get the ID of newly created comment etc.
Yeah I think something like this can work fine. When will poll on step 4 be done? can you do it based on some event that tells you processor passed the block where the tx was included?
Yes, it will be after the block is processed, we already wait for that
It actually won't be polling, just a single request since if the metaprotocol event is not there already something went bad. We can still retry for some time though
Background
We will start to use metaprotocol approach, as described here
https://github.com/Joystream/joystream/issues/1990
in various places of our current chain where the runtime is presently doing the validation. Primary candidates are currently
We assume that this has been implemented in the runtime:
https://github.com/Joystream/joystream/issues/2919
so we have extrinsics akin to
WorkingGroup::worker_remark(origin, worker_id, message: Vec<u8>)
WorkingGroup::lead_remark(origin, message: Vec<u8>)
Membership::member_remark(origin, member_id, message: Vec<u8>)
which have corresponding events
WorkingGroup::WorkerRemarked(worker_id, message: Vec<u8>)
WorkingGroup::LeadRemarked(message: Vec<u8>)
Membership::MemberRemarked(member_id, message: Vec<u8>)
The main idea here is to have message buses that are already authenticated by the chain, thereby allowing any metaprotococol processing logic to avoid signature verification and custom nonce management.
Purpose
The purpose of this example is to make it very tangible for what exactly it would look like for someone to take this approach.
Example
Domain
Let us start with a very basic kind of forum
Messages
Based on the domain above, we can suggest with a rather plausible set of message formats which will be broadcasted using the extrinsics mentioned above, but note
The message id fields are populated by a unique fixed, publicly known, value for each message during serialization. When deserializing a buffer, even if the process succeeds for a given message type, the id must still be verified to verify a match.
Query Node Schema
Ideally, a fully featured forum would have no impact on the schemas as we currently have them in
Olympia
, with the exception of likely simplifications as a result of being able to drop thehttps://github.com/Joystream/joystream/blob/olympia/query-node/schemas/forum.graphql
but here I will include a sussinct version corresponding to the simplified model we have here.
There are also a bunch of event entities which should also be maintained, but skipping them for now.
Query Node Mappings
The query node mappings for each of the events above will proceed to attempt to decode
message
payload with one of the supported messages:WorkingGroup::WorkerRemarked
looks forDeleteThread
Membership::MemberRemarked
looks forCreateThread
EditThread
CreatePost
EditPost
Be aware that it is entirely possible that
message
cannot be decoded into any supported message, in which case, it should just be ignored. It is also critical that it is ensured that the right working group is being processed. Another important point is that to make certain kinds of actions feasible, like deleting an entire thread in one message (which previously was not possible on-chain), one must exploit the native filtering capabilities of the underlying database in order to make it computationally efficient. In some case, depending on the use case, this may warrant introducing redundant information that makes this sort of processing faster.A final concern to be aware of is that significant validation will be required in these mappings, compared to normal mappings. In normal mappings, you can trust that the validation was done by the runtime, so all entities being referenced are known to exist and be accessible to the caller, however now this is no longer the case. This means you cannot blindly trust that, for example, the author issuing
EditThread
is referring to a thread that actually exists, or that this caller actually is the author for.DevUX
Notice that when a metatransaction fails, because the processor rejects it, there is no event returned from the full node or the query node. There is also no clear success event. The lack of such an event, and corresponding error code or new IDs created, can be a source of complexity for applications. One can try to introduce metaprotocol events as entities, but they are not sent to the caller via any sort of subscription interface.
There is a way to address this: by introducing subscriptions into Hydra that mappings can write to, but this does not exist. Read more about the DevUX pains of the metaprotocol approach, and ideas for what we can do about them longer term.
https://github.com/Joystream/joystream/issues/2018