Make UUri specification more formal

sophokles73 commented 4 months ago

This addresses #115 and other issues.

The UUri specification has been lacking many details that are relevant for correctly implementing the data model in client libraries.

A lot of the specification's textual content has been transformed into formal requirements based on invariants and predicates expressed using the Object Constraint Language.

Last but not least, the 'id' and 'ip' properties have been removed from UAuthority's data model because they did not really serve a specific pupose beyond the semantics of the 'name' property.

sophokles73 commented 4 months ago

Here's my first shot at a UUri spec formalization.

One thing that still bothers me: The 8-bit length field of the Micro Format allows for the authority to use up to 255 bytes. The name property, on the other hand, contains (unicode) characters which may be encoded into up to three bytes (using UTF8) per character.

This shouldn't necessarily be an issue because most characters used for the logical name will consist of ASCII characters only and those are encoded into a single byte (using UTF8).

However, IMHO we should properly handle cases in which the authority would be encoded into more than 255 bytes. Given that the Long and Short Format do not put any constraints on the authority length, I wonder if we really need to constrain it in the Micro Format.

If we need/want to set a limit, then IMHO we should define this constraint on the Object Model itself. We should then probably only allow ASCII characters in the authority name so that we know upfront, how many bytes it would be encoded into ...

sophokles73 commented 4 months ago

@stevenhartley @PLeVasseur @tamarafischer @AnotherDaniel @evshary

any feedback would be welcome

stevenhartley commented 4 months ago

Here's my first shot at a UUri spec formalization.

One thing that still bothers me: The 8-bit length field of the Micro Format allows for the authority to use up to 255 bytes. The name property, on the other hand, contains (unicode) characters which may be encoded into up to three bytes (using UTF8) per character.

This shouldn't necessarily be an issue because most characters used for the logical name will consist of ASCII characters only and those are encoded into a single byte (using UTF8).

However, IMHO we should properly handle cases in which the authority would be encoded into more than 255 bytes. Given that the Long and Short Format do not put any constraints on the authority length, I wonder if we really need to constrain it in the Micro Format.

If we need/want to set a limit, then IMHO we should define this constraint on the Object Model itself. We should then probably only allow ASCII characters in the authority name so that we know upfront, how many bytes it would be encoded into ...

@sophokles73 we need to follow RFC3968 which means it cannot be any form of UTF-8 character only those characters listed in Appendix A, a collection of ABNF for URI.

sophokles73 commented 4 months ago

we need to follow RFC3968 which means it cannot be any form of UTF-8 character only those characters listed in Appendix A, a collection of ABNF for URI.

Well, for the long and short form URIs that is done by means of percent encoding any illegal characters. However, that also increases the number of ASCII characters in the URI.

For the micro form serialization I do not see why we would be bound by RFC3986. We are not producing a (textual) URI but a custom binary representation and as long as we can recreate the original UUri from it, we should be safe, shouldn't we?

Another approach, of course, would be to restrict UAuthortity.name to only contain ASCII characters in the first place, if that is what you mean.

stevenhartley commented 4 months ago

we need to follow RFC3968 which means it cannot be any form of UTF-8 character only those characters listed in Appendix A, a collection of ABNF for URI.

Well, for the long and short form URIs that is done by means of percent encoding any illegal characters. However, that also increases the number of ASCII characters in the URI.

For the micro form serialization I do not see why we would be bound by RFC3986. We are not producing a (textual) URI but a custom binary representation and as long as we can recreate the original UUri from it, we should be safe, shouldn't we?

Another approach, of course, would be to restrict UAuthortity.name to only contain ASCII characters in the first place, if that is what you mean.

Yes, lets restrict to ASCII characters

sophokles73 commented 4 months ago

@stevenhartley I have updated according to your feedback but also have added some questions to the comments ...

stevenhartley commented 4 months ago

@stevenhartley I have updated according to your feedback but also have added some questions to the comments ...

I answered your questions this morning.

sophokles73 commented 4 months ago

@stevenhartley I have updated according to your feedback but also have added some questions to the comments ...

I answered your questions this morning.

@stevenhartley It looks like some of them slipped through ... maybe you haven't seen them because they have been collapsed by GitHub?

sophokles73 commented 4 months ago

@stevenhartley I just had a little chat with @AnotherDaniel about his problems using UUris from within his uSubscription implementation which are mainly related to the duality of identifiers (name vs. id) :-) We ended up thinking out loud about trying to get rid of the numerical id properties in the same way as we did in UAuthority, i.e. leave the resolving of logical names to identifiers to the UTransport that actually requires it. Currently, this would only affect the SOME/IP client if I am not mistaken and we could imagine it maintaining a mapping of names <-> (SOME/IP) ids in very much the same way as we let it do the mapping of names <-> IP addresses.

Would it make sense to think a little more into that direction or is it totally out of the question from your point of view?

stevenhartley commented 4 months ago

@stevenhartley I just had a little chat with @AnotherDaniel about his problems using UUris from within his uSubscription implementation which are mainly related to the duality of identifiers (name vs. id) :-) We ended up thinking out loud about trying to get rid of the numerical id properties in the same way as we did in UAuthority, i.e. leave the resolving of logical names to identifiers to the UTransport that actually requires it. Currently, this would only affect the SOME/IP client if I am not mistaken and we could imagine it maintaining a mapping of names <-> (SOME/IP) ids in very much the same way as we let it do the mapping of names <-> IP addresses.

Would it make sense to think a little more into that direction or is it totally out of the question from your point of view?

No we cannot do this (leave it up to the transport layer). The source of truth for service ids is the proto and is known to the application layer code or uDiscovery (not the transport layer). UAuthority to ip address can be done using linux resolver or static IP address table.

AnotherDaniel commented 4 months ago

Hm, I hear what you're saying. So pushing ahead with uDiscovery, and making that the source of truth for this kind of lookup - based in the proto definitions - would that be an option?

I really don't like this information-duality between names and ids in the UURi, also the fact that we are redefining/extending what an uri is... feels off. Would much rather have uris just be uris, and have a well defined, separate way to translate to id representation and back (like url - DNS - ip)

stevenhartley commented 4 months ago

I'm open to other suggestions and agree when we only had long form URI it was much simpler. I haven't been able to think of a different solution.

AnotherDaniel commented 4 months ago

I'm open to other suggestions and agree when we only had long form URI it was much simpler. I haven't been able to think of a different solution.

This might become too big a discussion for this PR - take this out into a dedicated issue maybe?

I'm asking this from my usual clueless point of view - but my thinking would be: shouldn't the up-client/boundary-ustreamers adapters be able to perform name-to-id-to-name translation for the messages that flow by, with the relevant information retrieved from uDiscovery (DNS-style) services?

These uDiscovery entities, within a certain scope (i.e. vehicle), might then address the question of how to efficiently manage their translation tables as some form of distributed configuration, that can be managed as a separate problem to solve (so these tables might be just built from proto definitions, as default/semi-static config, but obviously some form of updatability might be desirable in the future)...

evshary commented 4 months ago

Another approach, of course, would be to restrict UAuthortity.name to only contain ASCII characters in the first place, if that is what you mean.

Yes, lets restrict to ASCII characters

Sorry for disrupting the discussion. I also like the restriction of UAuthority.name, but I'm not sure what the range of ASCII characters here is. Does it only limit to use alphabet and number, or it can also use something like !, +, @, etc? Also, is the UAuthority.name still case-insensitive? These are related to whether the character are valid when transforming into protocol topic / key.

sophokles73 commented 4 months ago

Sorry for disrupting the discussion. I also like the restriction of UAuthority.name, but I'm not sure what the range of ASCII characters here is. Does it only limit to use alphabet and number, or it can also use something like !, +, @, etc?

In the PR I am currently restricting the authority to be a host following the corresponding production in https://datatracker.ietf.org/doc/html/rfc3986#appendix-A

In particular, this also includes characters from the sub-delim production which also includes !, & etc

Also, is the UAuthority.name still case-insensitive?

Yes, FMPOV that is the plan.

evshary commented 4 months ago

In the PR I am currently restricting the authority to be a host following the corresponding production in https://datatracker.ietf.org/doc/html/rfc3986#appendix-A

In particular, this also includes characters from the sub-delim production which also includes !, & etc

OK, then I still need to do some transformation while generating Zenoh key. At least, $ and * are forbidden in Zenoh key.

Yes, FMPOV that is the plan.

Then, no matter MQTT topic or Zenoh key should transform the authority to either lowercase or uppercase, since they are case-sensitive.

Is there any reason why we stick to RFC3986 standard here? If not, how about defining our own UAuthority for our own purpose?

sophokles73 commented 4 months ago

Is there any reason why we stick to RFC3986 standard here? If not, how about defining our own UAuthority for our own purpose?

IMHO the biggest advantage is to be able to use tooling that is available for RFC3986 URIs, e.g. for parsing, comparing etc. Most programming languages have support for URIs in some way.

That said, as long as we make sure that a UUri is-a RFC3986 URI, we should be on the safe side. This should indeed allow us to e.g. restrict the supported characters in the host production if we feel that this would be beneficial. Do you have a concrete proposal for that?

evshary commented 4 months ago

IMHO the biggest advantage is to be able to use tooling that is available for RFC3986 URIs, e.g. for parsing, comparing etc. Most programming languages have support for URIs in some way.

Totally agree with you.

Do you have a concrete proposal for that?

No, but I'm thinking perhaps using unserved here in UAuthority is enough.

unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"

Also, limiting the using lowercase (or uppercase) alphabets here might be simpler. But it's still ok to transform UAuthority into lowercase (or uppercase) in the transport layer (MQTT, Zenoh...).

stevenhartley commented 4 months ago

@sophokles73 (et. al.),

I want to summarize conversations over email so we can conclude on this topic. I really don't think we will get a single UUri definition that will map perfectly to all transports so need to to compromise, taking into consideration the ease of use, $ for transmission, and other factors. Given that we will have to statically map service/topic information for mechatronics anyways (for SOME/IP), I think the "good enough" solution shall be:

Go back to a single definition of a UUri that takes the best of long & short while removing micro all together
Make UAuthority mandatory: Removes ambiguity if present or not and allows for simplifies normalization of UUris
Constrain UAuthority::name and UEntity::name to "unreserved" defined above (and possibly limit size)
Represent UResource as an ID in lieu of name+instance+Message strings: This reduces the size of the UUri while still providing uniqueness for the UResource and flexibility for UEntity naming to number mapping. We keep the definition of id ranges to what we've already defined (i.e. 0 is repsonse, 0-0x7FFFF is request, 0x8000- is publish/notification)

If we settle on above, there is no more need for isResolved(), isLong(0, isShort(), isMicro() and all the other validator logic that is overly complicated. Serializers are simplified back to a single toString() and fromString() and the string format is now:

up://[UAUTHORITY_NAME]/[UENTITY_NAME]/[UENTITY_VERSION_MAJOR][.UENTITY_VERSION_MINOR]/[URESOURCE_ID]

I've captured these changes in #126 . The reason for keeping UAuthority, UEntity, and UResource is to minimize the amount of changes in all the existing code and not to re-invent UUri completely.

sophokles73 commented 4 months ago

... and the string format is now:

up://[UAUTHORITY_NAME]/[UENTITY_NAME]/[UENTITY_VERSION_MAJOR][.UENTITY_VERSION_MINOR]/[URESOURCE_ID]

So far, the minor version had not been part of any of UUri's serializations. Just want to make sure that we really need it in there. And if we do, I guess the inclusion of minor version should be dependent on the presence of major version

up://[UAUTHORITY_NAME]/[UENTITY_NAME]/[UENTITY_VERSION_MAJOR[.UENTITY_VERSION_MINOR]]/[URESOURCE_ID]

or be put into a separate path segment

up://[UAUTHORITY_NAME]/[UENTITY_NAME]/[UENTITY_VERSION_MAJOR/[UENTITY_VERSION_MINOR]/[URESOURCE_ID]

However, I'd prefer the former, if we really need minor version in it ...

sophokles73 commented 4 months ago

@stevenhartley I have adapted the UUri spec according to the discussed simplifications of the UUri data model. Some things I want to bring to your attention:

I propose to use base16 encoding for the identifiers in the URI which makes them easier to read FMPOV and will also save a few bytes compared to decimal encoding
The host production used for the authority name has not been adapted (yet) to use only unreserved characters as proposed by @evshary. We might want to consider this, though.
With the introduction of explicit wildcard values we no longer support URIs having empty path components. FMPOV this is actually good because it makes parsing much easier and should not have any real impact on message exchanges because the wildcards will only be used in the definition of forwarding rules in a streamer.
I have added OpenFastTrace specification item markup where appropriate. We can use these for testing the OpenFastTrace toolchain before rolling out to other spec documents.

stevenhartley commented 4 months ago

@stevenhartley I have adapted the UUri spec according to the discussed simplifications of the UUri data model. Some things I want to bring to your attention:

I propose to use base16 encoding for the identifiers in the URI which makes them easier to read FMPOV and will also save a few bytes compared to decimal encoding

The host production used for the authority name has not been adapted (yet) to use only unreserved characters as proposed by @evshary. We might want to consider this, though.

With the introduction of explicit wildcard values we no longer support URIs having empty path components. FMPOV this is actually good because it makes parsing much easier and should not have any real impact on message exchanges because the wildcards will only be used in the definition of forwarding rules in a streamer.

I have added OpenFastTrace specification item markup where appropriate. We can use these for testing the OpenFastTrace toolchain before rolling out to other spec documents.

I'll take a look. I started to code these things on the 1.5.8 branch of up-java and the serializers and deserializers were drastically simplified as well.

sophokles73 commented 4 months ago

@stevenhartley I have pushed changes and amended your comments. Would you mind taking another look?

stevenhartley commented 4 months ago

@sophokles73 not sure if my comment got lost but the diagram needs to have a white background (it was removed) otherwise I cannot view it when using dark mode in github.

sophokles73 commented 4 months ago

@stevenhartley I have added the white background and updated the terminology used for ue_id

sophokles73 commented 4 months ago

@stevenhartley any idea why the verify-pr check is not completing? Or what it actually does/is, for that matter?

stevenhartley commented 4 months ago

@stevenhartley any idea why the verify-pr check is not completing? Or what it actually does/is, for that matter?

It builds the proto to make sure there are no issues with syntax. @sophokles73 I can't seem to retrigger it on my side, if you push something new (single line change) it should retrigger the workflows

sophokles73 commented 4 months ago

@stevenhartley this PR seems to be stuck, I have tried to push some changes but that doesn't seem to resolve the situation. I will now close this PR and create a new one instead ...

sophokles73 commented 4 months ago

Reopening to test a theory ...

eclipse-uprotocol / up-spec

Make UUri specification more formal #121