AMWA-TV / is-13

AMWA IS-13 NMOS Annotation Specification [Work In Progress]
https://specs.amwa.tv/is-13
Apache License 2.0
1 stars 1 forks source link

Minimums for additional limitations #24

Closed garethsb closed 1 year ago

garethsb commented 1 year ago

https://github.com/AMWA-TV/is-13/blob/374c934932232e6cdf96a7dae7cb3558b45beaca/docs/Behaviour.md#additional-limitations identifies possible ways an implementation might be limited. Also discussed in https://github.com/AMWA-TV/is-13/issues/23#issuecomment-1572589318.

In Slack, @rbgodwin-nt gave a strawman for some minimum limitations:

from the standpoint of wanting to do some deployment on the NMOS Testbed on AWS and adding labels via Ansible it would be very helpful if I knew that all the NMOS resources supported some minimum set of annotations. For example, all NMOS Resoueces (devices, nodes, senders, receivers, flows etc) SHALL support a minimum of 5 user custom tags with at least 512byte strings each. Then I can do my ansible in a way that meets this and nto have to worry about some device chocking on my annotations.

@alabou responded:

such requirements may be too much form small devices (over 2KB per resource) of persistent storage. I think we must be very careful for such minimum requirement as it may prevent devices form exposing the annotation API at all.

Can we reach consensus on appropriate minimum limitations, e.g.

garethsb commented 1 year ago

FWIW, Kubernetes annotations have a limit of 63 characters for the name (seems broadly equivalent to the part after the urn:x-nmos:tag:user: prefix for our user tag names). According to this post there isn't a per-value limit, only a limit per resource of 256 kB. Certainly the latter obviously isn't appropriate for resource-constrained devices!

alabou commented 1 year ago

An absolute minimum requirement would be for a Controller or Tool to be able to implement its own annotation scheme; i.e. set the label of a resource to some unique value, imposing a minimum number of characters like 64 for the label and have a flag indicating if a resource has been annotated. From there a Controller / Tool can have its on database for annotations based on the unique label and flag.

I'm not a fan of putting too much requirements on the device. It seems that allowing the Controller to do the annotation job itself would be a better approach. Those Tools and Controller could even persist the annotations using some global system registry to get cross-Tool/Controller interop.

garethsb commented 1 year ago

Um, that sounds like a completely different approach!

Obviously, resources already have unique identifiers that Controllers can associate with whatever external "annotations" they like.

The group previously discussed and rejected implementing an open API for this on the Registry to PATCH "overlays" into what's returned from the Query API. It has benefits - primarily, no waiting for support from device vendors - and some challenges to be worked out - e.g. version and lifecycle/persistence management.

That discussion may be worth reopening, but what's wanted here are minimum specs for devices if we go ahead with the proposed Annotation API spec.

alabou commented 1 year ago

Sorry to be off-topic, I was not aware that this was discussed.

timhall99 commented 1 year ago

Here is my take on the bullet point list from the original post:

garethsb commented 1 year ago

Thanks, Tim.

  • a minimum number of values per user tag that MUST be supported 10

I'm intrigued by this one, what is the use case for so many values for each tag?

garethsb commented 1 year ago

Sorry to be off-topic, I was not aware that this was discussed.

@alabou, personally I'm still interested in exploring this approach and seeing whether there are simple but effective lifecycle and update semantics.

garethsb commented 1 year ago

have we defined reset behaviour ?

From Behaviour - Resetting Values:

  • For labels and descriptions, the implementation MUST either restore an initial or configured default value, or set the value to the empty string.
  • For a named tag, the implementation MUST either restore an initial or configured default array of values, or remove the named tag from the resource.
timhall99 commented 1 year ago

Thanks @garethsb I was sure we had discussed resetting, but it was late last night. I did mull over these minimum values, I felt that 24 chars for valeus might be too restrictive in some circumstances, hence the 32 - everywheer for consistency. For tag values, I started with a quantity of 3, but I thought such a low number would come back and bite us. Happy to discuss further -am I aiming too high, given the restricted resources of much of the equipment involved?

garethsb commented 1 year ago

32 characters for description, label, and 1+10 (name+values) x 4 tags = 32 x 46 characters per annotatable resource, minimum. If characters are stored in UTF-8, they take between 1 and 4 bytes. That makes the worst case minimum we'd be asking for 32 x 46 x 4 bytes per annotatable resource or approx 5.9kB. (I don't really want to put limits on supported codepoints, even though if restricted to first 128 codepoints (basically ASCII) which can be UTF-8 encoded in one byte, this would mean 1.5kB... I guess we could at least point that out to clients though?)

Restricting to 3 values per tag, takes the calculation from 5.9kB to 32 x 18 x 4B = 2.3kB. Just playing with numbers, if we went for minimum of 10 tags with minimum of 1 value each, that's 32 x 22 x 4B or 2.8kB. 3 tags with 1 value, that's 32 x 8 x 4B = 1kB.

Of course, these figures are all per resource. We haven't discussed a minimum limit on total annotations per device.

Need some low-memory device manufacturers to weigh in...

timhall99 commented 1 year ago

08/06/2023 The general conclusion from the call was :

  1. limits will be be expressed in bytes .rather than characters,
  2. limits are per resource. An overall limitation must not prevent every resource having the maximum number of characters and tag keys & values

Suggestion for the limits as follows (not sure that we were in total agrement on the SHOULD quantities of tags)

garethsb commented 1 year ago

We're going to have to be clear how a definition of string length in bytes is applied.

E.g.

The JSON strings "a\\b" and "a\u005Cb" are both three characters long according to the ABNF for JSON and three bytes long in UTF-8 (a\b).

The JSON "πŸ˜ƒ" (U+1F603) is 1 character long and 4 bytes in UTF-8 (0xF0 0x9F 0x98 0x83). The JSON "\uD83D\uDE03" is two characters according to the ABNF but they are a surrogate pair that encode πŸ˜ƒ and surrogate pairs are illegal in UTF-8, the correct encoding is the same 4 bytes in UTF-8 (0xF0 0x9F 0x98 0x83)...

(The two JSON strings "πŸ˜ƒ" and "\uD83D\uDE03" are equal.)

alabou commented 1 year ago

We could say in the spec that anything that is not in Latin-1 Supplement block (U+0000 to U+00FF) is assumed as being 4 bytes wide ... The limit of 32 generic chars should be expressed in bytes as 32*4 bytes or 32*4 Latin-1 Supplement characters or 32 non Latin-1 Supplement characters characters.

peterbrightwell commented 1 year ago

Update 2023-06-15: Group ok with the minimum to be specified as 64 bytes so up to 64 Latin-1 Supplement block characters (one byte), up to 32 Basic Multilangual Plane characters (two bytes), up to 21 Supplementary Plane characters (three bytes), up to 16 four-byte characters.

TODO: check the correct terminology for four-byte characters such as emojii and ancient Egyptian hierglyphs.

garethsb commented 1 year ago

Why the increase to 64 Bytes? To get 21 CJK code points?

The later code points in the Basic Multilingual Plane need up to 3 Bytes per code point. The rest of the planes (U+10000 to U+10FFFF) need the fourth Byte.

See https://en.m.wikipedia.org/wiki/UTF-8#Encoding

The first 128 code points (ASCII) need one byte. The next 1,920 code points need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also IPA extensions, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks. Three bytes are needed for the remaining 61,440 code points of the Basic Multilingual Plane (BMP), including most Chinese, Japanese and Korean characters. Four bytes are needed for the 1,048,576 code points in the other planes of Unicode, which include emoji (pictographic symbols), less common CJK characters, various historic scripts, and mathematical symbols.

alabou commented 1 year ago

Why the increase to 64 Bytes?

There was no increase to 64 bytes but a decrease to 64 bytes ...

The limit was 32 characters ... which implied 128 bytes ... With UTF-8 ASCII (1 byte) 128 characters seems too much while 64 seems better than 32 ... So using a max of 64 bytes allows 64 ASCII characters and a minimum of 16 complex characters.

This seems to be a good compromise ... getting more ASCII characters, not wasting memory and have a reasonable footprint.

We get U+0000-U+007F => 1 byte => 64 characters ASCII U+0080-U+07FF => 2 byte => 32 characters Complete Latin and other U+0800-U+FFFF => 3 bytes => 21 characters Japanese, Chinese, Korean U+10000-U+10FFF => 4 bytes => 16 characters Egyptian Hieroglyphs

garethsb commented 1 year ago

OK, I see. My recollection of the previous call was 32 ASCII chars and thus fewer of the 'bigger' chars.

If min of 64B x 3 storage per annotatable resource is OK for constrained devices, great. I'll update #26 accordingly and we can merge.

Do we need 64B for the user tag name (after the "urn:x-nmos:tag:user:" namespace prefix)? The simplicity of the same limit is nice but will it get used? Although the JSON Schema type of the tag name is just string, and is not required by IS-04 to be a URN, in this case we are discussing a URN so the character set defined for Namespace Specific Strings by RFC 8141 applies and that's an ASCII subset.

alabou commented 1 year ago

My understanding is that the limit of 64B corresponds to the complete tag name, not only what comes after "urn:x-nmos:tag:user:" so for ASCII there remain 44 characters after "urn:x-nmos:tag:user:".