cloudevents / spec

CloudEvents Specification
https://cloudevents.io
Apache License 2.0
5.02k stars 580 forks source link

make eventId an optional field #326

Closed jric closed 5 years ago

jric commented 5 years ago

In our organization eventId is an optional field. Therefore, we do not use the CloudEvent spec as defined. I would prefer to fix that impedance. I hope either I can convince cloudevent spec to change or someone here can convince me that we're doing it wrong.

Reasons to make it optional:

  1. DRY (don't repeat yourself) principle: There is usually some combination of fields in any event which can be used to identify a "unique" event. In our org, we are not interested in any event which could be confused with another but for the id. We always want to know: Where did the event originate, at what time, and what type of event was it. Combined, that information always uniquely defines an event, so we don't need another id. We just need the data and to know which fields hold that data. Then, if we want, we can always combine (e.g. hash) that data later to create a unique composite key.
  2. KISS (keep it stupid simple): One of our principles is to keep things as simple as possible, especially at the interface. No eventId means fewer fields for consumers to manage, and fewer fields for producers to create or worry about generating correctly.
  3. Single source of truth: If there is both an eventId and another unique key, what if those two don't agree? On the plus side, the id can serve as a "checksum" of sorts. However, generating the eventId has to be done correctly. It must represent the producer's "idea" of what constitutes a unique event, and there is no documentation of that logic, because it's hidden in the algorithm that chooses when to mint a new id, rather than being explicit in the data. For instance, it would probably be a bug if each communication retry generated a new "eventId", or if two events were generated with the same id, but different payloads, e.g. in a multi-threaded race condition.

Reasons to keep it required:

I'm not sure what to say here. I think there are a class of tools that will be able to operate much more simply once there is a single field that represents the event. However, I think those tools could simply require the id at that point, rather than requiring every event to have the same eventId throughout its lifecycle. In case of an event pipeline where an event forks into multiple events, this id cannot stay static anyway.

jric commented 5 years ago

If this proposal is approved, I will submit a PR.

duglin commented 5 years ago

@jric are you able to join one of our weekly calls? They're on Thursdays at 12pm ET. I'm asking because we typically focus on PRs first but if there's an additional topic (such as an issue like this one) then we can add it to the agenda but its always better to have the proponent of the idea on the call to present their thinking.

You're also free to wait for feedback on the issue itself, but I know folks are pretty busy and so it might be a while before that happens.

cneijenhuis commented 5 years ago

Here is my try at convincing you :)

Generally, CloudEvents is concerned with interoperability. I do not question that within your organization, you may not need an explicit eventId but can uniquely identify and deduplicate an event from multiple fields. However, not having an eventId hinders interoperability, because a generic middleware can not know which these fields are.

In other words: By not wanting to put the eventId, you are removing the possibility of a generic CloudEvents consumer to deduplicate your events (while your proprietary consumers can deduplicate your events).

jric commented 5 years ago

That was a very good try, @cneijenhuis ! Honestly, I appreciate that. I vigorously agree that the main purpose of CloudEvents (or any standard) is interoperability. I disagree that knowing which fields uniquely identify an event needs to be proprietary. I would actually argue that hiding this information, or providing a "unique" id based on unknown and possibly incorrect logic, is what hinders full interoperability. This unique fields information can be passed as part of the event headers, inferred by the type of the extension given to the event, or as we do internally, it can be specified in a separate "event contract" defined for each specific type of event. Once the information is available, it provides richer context than an opaque id to those downstream systems. I appreciate what you're saying about deduplicating events. I agree that having a single field which can be guaranteed to be unique is the simplest way to implement deduplication. That is a strong argument. My point is that not all systems will need to implement event deduplication. Relative to the simplicity that is traded-off in the mental model of the implicit "guarantee" and implicit logic behind the uniqueness guarantee, I think it is ok to let any system that wants to do deduplication handle it whichever way best fits the situation, e.g. no deduplication, deduplication based on a "requestId" instead of an "eventId", deduplication based on a digest of the complete event or some subset of "unique" fields, deduplication that requires the "eventId" to be set, or even deduplication conditionally if eventId or other fields are set. I think we maximize the benefits of standardization by accommodating the widest range of use cases. Like I said before, I don't think it hurts anyone if someone wants to build ToolX that requires eventId and producerY doesn't have have eventId and for some reason they want to use ToolX, then they can decide to either use a different tool or start providing eventId.

jric commented 5 years ago

Sorry, I forgot to answer the other question. @duglin, yes I will join a Thursday call. Let me shoot for next week, to give a little more time for this text-based discussion to play out.

duglin commented 5 years ago

@jric ok - I'll try to remember to add this to next week's agenda - don't hesitate to poke me if it looks like I forgot.

To the issue itself...

I think it is ok to let any system that wants to do deduplication handle it whichever way best fits the situation, e.g. no deduplication, deduplication based on a "requestId" instead of an "eventId", deduplication based on a digest of the complete event or some subset of "unique" fields, deduplication that requires the "eventId" to be set, or even deduplication conditionally if eventId or other fields are set.

This aspect is the part that pushes me towards keeping it as required. In a distributed system where the producers and consumers are not tightly coupled, the producer will not know how each consumer will use a field like EventID. Some may use it for dedup while others may simply use it as some kind of unique tracking id. However, w/o a consistent field that all CloudEvents are required to have, each consumer will be forced to vary what field(s) it uses for this purpose based on what data is available to it on each incoming event. And I see a lot of value in providing this level of consistency for our consumers to make their life (and consumption of the events) a little easier.

jric commented 5 years ago

Thanks @duglin . I'll put this on my calendar.

I see what you're getting at in terms of consistency. If all events were produced with a unique id and if no-one cared about adding this step to the event production, and if those ids were produced correctly and if no-one ever cared about squeezing their events down to consume the least possible bandwidth, then I agree this would be a no-brainer. Unfortunately, that's not the case, by our own counter-example. We are thinking a bit like Apple or Tesla -- KISS the interface before the internals.

By the principle of separation of concerns, IMO no generic consumer should be doing event deduplication. If event deduplication is required, as with an unreliable transport, there should be a dedicated component for it. That component should be plugged in as a filter between the producer and the consumer.

You also probably wouldn't want people doing double-duty with the eventId as a traceId. As the event is transformed, forked, etc., it should get a new id, and that will make tracing more complicated than you would be able to achieve with a more purpose-built field.

cneijenhuis commented 5 years ago

We are thinking a bit like Apple or Tesla

I have great respect for what Apple and Tesla are doing. However, I don't think we should blindly copy what they do. This is a tradeoff decision - again, if your events are company-internal only, and don't need to work with software from other providers, the decision may be different than if you are concerned with interoperability.

KISS the interface before the internals

Comparing the interface of To deduplicate, use the eventID to To deduplicate, get the event contract in another format or To deduplicate, use the event headers in another format, I'd argue that eventID is actually the KISS interface 😉

By the principle of separation of concerns, IMO no generic consumer should be doing event deduplication. If event deduplication is required, as with an unreliable transport, there should be a dedicated component for it. That component should be plugged in as a filter between the producer and the consumer.

This paragraph is a bit confusing to me. For me, the "pluggable filter component" is an example of a "generic consumer". E.g. a middleware is not the final consumer in the chain, but it still consumes the events.

A more concrete example is a transport layer, such as Apache Kafka, that is idempotent towards sender retries (If a producer sends the same message twice, Kafka will not put the duplicate into the stream). The eventID makes this relatively easy to implement for any transport layer. The transport layer is then the component that is plugged between producer and (final) consumer.

jric commented 5 years ago

Comparing the interface of To deduplicate, use the eventID to To deduplicate, get the event contract in another format or To deduplicate, use the event headers in another format, I'd argue that eventID is actually the KISS interface 😉

@cneijenhuis, not sure if you're just joking, but in case you're making a serious argument, then I believe it's a straw-man. The simplest thing is: None of the above. And if you need any of the above, you would add in the data (eventId) and components (an eventId-deduplicator) to do exactly what you need.

An analogy to our current dilemma: From the HTTP standard, with which I assume you are likely familiar. The "status" field is a required header, because it is (or should always be) used by any reasonably well-architected communication agent (a web client.)

Another field which is dedicated to the integrity of the transport is only applicable in some situations, and is therefore defined as an "extension" to the base standard (the set of required fields.) An example of such a field is the Content-MD5 header.

This paragraph is a bit confusing to me

It looks like you interpreted it correctly, but if there are still open questions here, let me know. Otherwise, I look forward to chatting tomorrow.

duglin commented 5 years ago

@jric will you be able to join the call tomorrow? If we have time I thought this might be a good topic for discussion - or do you still want more comment-based chats?

duglin commented 5 years ago

on today's call we agreed to keeping it required for now. We can reopen if new info (e.g. adoption issues) arise.

alanconway commented 5 years ago

Making eventID optional does not IMO hinder deduplication or interoperability. If eventId is absent you are enabling "null deduplication" - effectively saying is "this event is distinct from all other events from the same source, it is not a duplicate and can never be duplicated". This is useful for applications that send cheap, idempotent events (best-effort or reliable) It's also useful to "switch off" cloudevent duplication overheads in applications that are doing some other kind of end-to-end dedup based on the event payload (e.g. encapsulated AMQP or MQTT message IDs) Requiring unique eventId when it's not needed is a burden on producers - UUIDs are expensive to generate and even a simple counter requires some form of persistence if the producer can be restarted. It makes it harder to write stateless producers that transform the output of some other process without holding any state of their own.

duglin commented 5 years ago

Reopening to ask: is it too expensive to ask a producer to generate a unique ID?

alanconway commented 5 years ago

id exists to support deduplication. If you don't need deduplication then you don't need an id - so the application should decide if events need an id, not the spec.

For comparison: message IDs are not required by MQTT at QoS 0, or AMQP at QoS 0 or 1. These are the most popular QoS choices (AMQP impls often don't even support QoS 2.) They are useful for idempotent messages, when best-effort is good enough, or when the app has it's own end-to-end retry/dedup.

Requiring the app to provide message-id in those cases complicates the app's implementation and affects performance on a per-message basis. Generating a unique source once per producer is fine, but imposing unnecessary per-message uniqueness (persistence or UUID generation) is bad.

duglin commented 5 years ago

I always thought that id was introduced just so we have some consistent way to uniquely identify an event and each producer/consumer didn't need to invent their own. And all consumers could count on it always being there. How that ID would then be used, dedup vs something else, was just an impl choice to me. But that's just me...

alanconway commented 5 years ago

On Thu, Apr 18, 2019 at 11:28 AM Doug Davis notifications@github.com wrote:

I always thought that id was introduced just so we have some consistent way to uniquely identify an event and each producer/consumer didn't need to invent their own. And all consumers could count on it always being there. How that ID would then be used, dedup vs something else, was just an impl choice to me. But that's just me...

That's a fine reason to have an ID. But if you don't care about uniquely identifying messages (i.e. when I get an event I act on it, I don't care if it's somehow "the same" as some other event) then you don't need an ID. Requiring every message to have an ID, and assigning formal uniqueness constraints and dedup semantics (producers MUST assign unique IDs, consumers MAY ignore duplicate IDs) puts a burden on producers that don't need unique identification. Of course apps that do need it are free to use it however they wish.

Alternatively you could just declare id to be a slot for arbitrary application use as you suggest (AMQP message-id is like that) but in that case we can't have formal uniqueness requirements in the spec and it can't be used in a generic way by libraries and intermediaries to perform deduplication in a transport-neutral way.

You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cloudevents/spec/issues/326#issuecomment-484561733, or mute the thread https://github.com/notifications/unsubscribe-auth/AB3LUXTOIOBOVUVO324ZKDLPRCHSLANCNFSM4FZMJWYQ .

duglin commented 5 years ago

Yep, but since producers can't know for sure if consumers need this or not, but we do ack that some might, aren't we kind of forced to require it? W/o it being a MUST, consumers can't count on it and then it becomes almost useless since if consumers need whatever functionality they derive from it, then need to find a way to perform those semantics in the absence of it. Which means, thinking about the other threads we have going on, some might look at source+type+\ to try to generate some uniquely identifying value. And we're leaving that up to each consumer to make up their own rules. Seems risky and non-interoperable.

I guess I'd like to understand why generating an ID is so hard? Even one that's loosely timestamp based seems pretty trivial

alanconway commented 5 years ago

On Thu, Apr 18, 2019 at 4:41 PM Doug Davis notifications@github.com wrote:

Yep, but since producers can't know for sure if consumers need this or not, but we do ack that some might, aren't we kind of forced to require it?

Producers can know based on what they produce. Heart-beat events never need to be de-duplicated. "Latest measurement" events never need to be de-duplicated. QoS 0 isn't always the right choice but it's important enough not to exclude it.

W/o it being a MUST, consumers can't count on it and then it becomes almost useless since if consumers need whatever functionality they derive from it, then need to find a way to perform those semantics in the absence of it.

The semantics of missing/empty ID are clear: "A message with no ID is not a duplicate." No need to compare anything with anything else: Just process it.

Which means, thinking about the other threads we have going on, some might look at source+type+ to try to generate some uniquely identifying value. And we're leaving that up to each consumer to make up their own rules. Seems risky and non-interoperable.

Many/most will use event IDs as you propose for maximum interop, clues to intermediaries etc. no problem there. But we are not mandating anything about the form of the ID, we're leaving the user with the puzzle of how to define these things. Since NOT defining any ID is offered as an option in most messaging systems, why rule that option out for CE?

I guess I'd like to understand why generating an ID is so hard? Even one that's loosely timestamp based seems pretty trivial

Imagine I'm implementing CE eventing sources for AMQP and MQTT message. Both these protocols sometimes include an event ID and sometimes don't, depending on the QoS level and the whims of the sender.

I would like to simply forward the incoming message-ID as the event-ID, but if I'm not allowed to send an event with no ID, what do I do with a message with no ID?

If the sender chose not to set a message ID it is because the message does not need to be identified. So all the performance and code overhead I add to generate on (timestamps, UUIDs, persistence etc.) all the risks I introduce by "editing" messages without understanding the application, actually serves no purpose. I don't mind writing code, I hate writing useless code.

You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cloudevents/spec/issues/326#issuecomment-484682556, or mute the thread https://github.com/notifications/unsubscribe-auth/AB3LUXQ26RQ6W3BQS6SC2EDPRDMG5ANCNFSM4FZMJWYQ .

duglin commented 5 years ago

@alanconway thanks for the reply. Can you do me a favor and open a PR with the proposed changes you want to make to the spec? I suspect at a minimum it's s/REQUIRED/OPTIONAL/ and adding "If present," to the other bullets. This will help focus/force the discussion.

alanconway commented 5 years ago

On Tue, Apr 23, 2019 at 11:32 AM Doug Davis notifications@github.com wrote:

@alanconway https://github.com/alanconway thanks for the reply. Can you do me a favor and open a PR with the proposed changes you want to make to the spec? I suspect at a minimum it's s/REQUIRED/OPTIONAL/ and adding "If present," to the other bullets. This will help focus/force the discussion.

Here it is: https://github.com/cloudevents/spec/pull/422

duglin commented 5 years ago

Since the group decided to not merge #422 I believe we can close this issue now. Let me know if I'm mistaken.