cloudevents / spec

CloudEvents Specification
https://cloudevents.io
Apache License 2.0
5.06k stars 583 forks source link

Sequence extension - type string and lexicographical order #1151

Closed black-snow closed 1 year ago

black-snow commented 1 year ago

I was stumbling over some issue when I evaluated if sequence was the right field for revisioning my payloads.

The sequence extension requires for sequence:

MUST be a non-empty lexicographically-orderable string

But then:

RECOMMENDED as monotonically increasing and contiguous

contiguous seems kinda fuzzy when looking at strings. And worse, the only given sequencetype is an (for some reason) unsigned 32-bit int, which

MUST start with a value of 1 and increase by 1 for each subsequent value

But if we simply turn our numbers into strings like "1", "2", ... and 10 the lexicographical order would (arguably) be 1, 10, 2, which is not what we want. We'd have to zero left-pad everything like 01, 02, 10, ... but then we'd have to zero-pad to the full legth (always 10 characters in total).

So is sequence required to be orderable but we'll just ignore the lexicographical order when sequencetype == Integer because it makes no sense?


What if sequence was of type string|number and just guaranteed a total order?
Why do we waste half of the space with a signed int? (Who'd use negative seq values, especially if the seq must start at 1?)

duglin commented 1 year ago

Last year we changed that extension to just be a string and we got rid of the sequenceType stuff exactly because of the reasons you stated - it was just confusing. See: https://github.com/cloudevents/spec/blob/main/cloudevents/extensions/sequence.md for the latest - and yes it means you'll need to pad with zeros. However, notice that you could also use non-integers for the chars to reduce the number of bytes needed.

black-snow commented 1 year ago

@duglin thanks for the quick reply. Did you consider string|number?

duglin commented 1 year ago

I think that would then require some kind of "sequenceType" field again and (if my memory is correct) we decided to just keep it simple and only require strcmp to keep it easy and fast.

duglin commented 1 year ago

See: https://github.com/cloudevents/spec/pull/1031 and https://github.com/cloudevents/spec/issues/923

black-snow commented 1 year ago

Why is that? If it's a number it's a number (well, you shouldn't use floats, I guess) and we know how to sort them - if it's a string it must be ordered lexicographically.

Now you'd have to zero left-pad with unknown length as the length is no longer defined ...

/edit: will go through your links

duglin commented 1 year ago

w/o a "type" you have to first scan the entire string to see if there are any non-int chars in there, or let the str2int logic error and then revert to strcmp.

I suspect people would be willing to reopen the discussion if someone wanted to PR a change that kept it simple :-)

black-snow commented 1 year ago

@duglin Are you talking about different serializations? Haha, in my mental model it's all just JSON and I was wondering why you wouldn't have a type.

/edit: Thinking loudly: If some protocols (HTTP) don't support the type system, should this affect the core spec? But well, there's no union types defined so no String|Integer anyway.
Leading zeros aren't allowed btw, according to RFC 7159. And the "canonical string-encoding" must be supported. So unless we indicate the intended type we can't allow both string and int - I see. Unless we say "try to parse it as int first - if no bueno take it as a string". But that's crap and wasted energy when you know it's a string anyway.

I guess string was chosen in the first place because of TBUUIDs etc.
Having another field just to say how to interpret another is clunky.
Having different versions of the spec just for this case would be insane.

I guess the easiest way for me now would be to roll my own format like always prefixing it with a type indicator. Or, well, just when I want denote an int like i:12345. Or go with n leading zeros anyway with the excuse that it's somehow a doubly encoded int.

I just realized that the spec says that the sequence field itself (well, its value) must be "a non-empty lexicographically-orderable string" - not the whole sequence. "[...] a simple string-compare type of operation" could also be "cut off the first n characters and compare the rest".

/edit2: Another way would be to put the (zero-padded) number of digits first. 2147483647 is the max so if I don't care about negative numbers a two-character prefix would suffice. 1 would become 011, 100 become 03100 and 2147483647 become 102147483647. Or one character in hex.

black-snow commented 1 year ago

Guess, I'll close this for now.

duglin commented 1 year ago

I think in the end you landed at the right spot... it's just a string so issues like "leading zeros aren't allowed per the rfc" don't really matter if you always treat it as a string. Which is why I said you can use non-int chars - which could save you bytes on the wire if that's ever a concern.