context for measurements

eclipse-archived / unide

Eclipse Public License 1.0

29 stars 17 forks source link

context for measurements #35

Closed ameinhardt closed 5 years ago

ameinhardt commented 6 years ago

As a system integrator, I want to get context information alongside the measurements, in order to facilitate the interpretation of the data. I'm not so happy about sending redundant information and suggested a manifest/schema -link for that. Nevertheless, I understand also that for simplicity and in case of retooling (Umrüsten) machines, such context might change. In that case inline context does make sense. Maybe we could allow inline or a context reference like json-schema "$ref"?

This is also a requested by Balluff and Trumpf. Former discussion here: https://www.eclipse.org/forums/index.php/t/1084951/ Previously discussed example context:

"context": {
    "temperature": {
        "unit": "Fahrenheit", // --> "" default, only label
        "gradient": 1.8, // --> 1.0 default, only for NUMBER
        "offset": 32, // --> 0 default, only for NUMBER
        "dataType": "NUMBER" // --> NUMBER default; [BOOLEAN,NUMBER,STRING,BASE64]
    }
}

bgusach commented 6 years ago

Hi @ameinhardt,

Do you mean allowing both possibilities, right? an URL to another description or an inlined description.

fpatz commented 6 years ago

I'd vote for an (optional) external reference. Inlined data is overly redundant. Architecturally, a URL would normally not point to the device itself, as a PPMP source is not a server in most cases. So, this is somewhat outside the domain of a payload protocol. For the metadata we may also have some overlap with Vorto, but I am not an expert with that.

bgusach commented 6 years ago

@fpatz absolutely, it would be weird that the URL points to the device itself. But a URL/any kind of reference to some resource with the description could work, although still clunky in my opinion. I can imagine many scenarios in which the consumer of the messages does not have access to that "description server", or the description is missing, outdated, etc. Moreover, for very intensive data transfer scenarios, sending this reference over and over again hundreds of times per second could be undesirable.

And of course, inlining it would be an absolute waste of bandwith.

At the end this description is something that would be used just for configuring data consumers, which should happen extremely seldom. I don't know... I think this feature doesn't look like a bright idea...

ameinhardt commented 6 years ago

Retooling (Umrüsten) is not so seldom. In that case, units and especially limits might change. Accuracy of a measurement could vary the more a machine is heated up etc.. I agree with the redundancy problem and architectural preference. On the other hand, in order to avoid complexity, PLCs might just send such information every time. So I propose to define a context field that

includes the current limits content
includes optional accuracy and offset as numbers
includes an optional unit. The unit value is not defined, but recommendation is given. E.g. as in senml, 12.1. Units Registry
each of these fields can be a static value or an array. The static value applies to all elements in measurements series, an array applies one-by-one to the series elements
can include other, not-standardized fields. See #36

If needed, this context should be inlined. In case it is not inlined, it can be referenced. A reference would be similar to $ref:

An object schema with a "$ref" property MUST be interpreted as a "$ref" reference. The value of the "$ref" property MUST be a URI Reference. Resolved against the current URI base, [...] All other properties in a "$ref" object MUST be ignored. The URI is not a network locator, only an identifier. A schema need not be downloadable from the address if it is a network-addressable URL, and implementations SHOULD NOT assume they should perform a network operation when they encounter a network-addressable URI.

It's up to the receiver, if he accepts references. He decides whether to resolve them every time, via a cache or any other registry that is not part of the scope of PPMP (vorto etc.). Btw., I think vorto doesn't define the metadata, but rather relies on ipso, e.g. direction object is defined here

bgusach commented 6 years ago

@ameinhardt I assumed that this description is just a help for humans (the "integrators"?) to understand the messages. If this is not the case, please correct me. Under this assumption, and regarding your comment:

Retooling (Umrüsten) is not so seldom

Seldom is actually a very ambiguous term :). What I wanted to say is that this description is only meaningful when the integrator is deciding what to do with this very kind of message. Once that is working, he does not really need at all to know that the unit is e.g. Fahrenheit until the next time the machine is retrofitted or modified. How often could that happen? once after millions of messages?

bf-bryants commented 6 years ago

Hi,

Given that I was one of the people who originally requested this change, I'd like to clarify why it's needed.

We often have analog sensors in use, which deliver only unqualified values in the range 0000-FFFF, so there's no obvious way of converting that to a useful value with a unit. Even IO-Link sensors can differ in their output, depending on the configured mode.

The system that handles collected data is typically managed by completely different people to those who set up the sensors. The additional coordination effort between groups is considerable, so it is simpler if the data qualification can be included as part of the PPMP message. It also automatically resolves the problem of how to handle configuration changes - especially with large numbers of sensors.

The additional data bytes are not a consideration in our case, as the local network is nowhere near capacity. However, external references would be a problem as external access is generally blocked. Using a reference to a locally available server only works until the message is passed to a different network segment, or even the cloud, where the reference again cannot be accessed.

Our requirement is to be able to have fully qualified self-contained packets of data, where the data can also be non-numeric.

I should also note that the entire context section was defined as optional on purpose. If left out, you effectively revert to the V2 format. Everybody is free to choose what fits their scenario best, and nobody has to waste any bandwidth they don't want to.

As far as units go: in our discussion (Bosch, Sick, Balluff), we came to the conclusion that treating the unit as a label would be more pragmatic, as use cases often arise where the unit ends up being something domain specific that won't convert to an SI form. That said, I do like the IETF SenML suggestion.

Steve

bf-bryants commented 6 years ago

About units:

The SenML IETF draft says this:

IANA will create a registry of SenML unit symbols.

To our knowledge, IANA has not yet done this, and it's not clear when this will become available. Until that time, SenML is effectively unusable unfortunately.

The method used by OPC/UA to define units might be worth looking at. Here a short overview of their fields from OPC UA Part 8, S. 15-16:

namespaceUri - identifies the organisation
unitId - int32 identifier for programmatic evaluation, -1 means none available
displayName - localized string
description - localized string

The suggested unit field for PPMP is the equivalent of the OPC/UA display name. I would like that we consider allowing the other three as well. In particular, the namespace allows us to choose between SenML/IANA or something else.

The unit ID is only relevant when a numbered list of units is referenced by the URI. For example, the IO-Link consortium does that.

As before, the whole thing must remain optional. Those who need context should be able to select between either a reference or inline (just not both at the same time). That covers the use cases that have been mentioned here.

Should we use the same field names from the OPC/UA spec?

Steve

PS. @ameinhardt: I hope I'm not too late with this suggestion!

bgusach commented 6 years ago

Hi @bf-bryants ,

I think the OPC UA approach still has the problem you pointed out before: people managing the receiving part of the system may not be able to get the description/semantics referred by the namespaceUri. However, this is as good as it gets IMO.

I'd suggest a simplified schema:

namespace: optional. If not provided, the consumers can assume it is an "in-house" unit.
unitID: a string. It is the actual ID within the namespace (if any). For instance fahrenheit. Should be like a variable name, kind of readable for humans, but programming friendly as well.
description: optional, can provide extra details. This is very important if there is no official documentation of the IDs in the namespace, or if this message is going to be sent to other networks with no access to this documentation.

I would not care at all about the OPC UA displayable names (displayName, description), especially if they are localized.

Not sure about the dataType field. In case of booleans, numbers and string, the JSON format is enough by itself. In case it is a base64 encoded string, maybe would suffice having that in the description? it is not very elegant, but converting automatically a base64 string into a chunk of bytes does not help further if you still don't have a detailed description of what all those bytes mean. And maybe somebody wants to use another encoding.

Then, another details would be how important it is for us to have those gradients and offsets that were suggested at the beginning. My first idea would be that the device sending this information should send the values already corrected instead of sending them along with the correction factors... but I don't know if this is possible in all cases.

What do you guys think?

ameinhardt commented 6 years ago

@bgusach: your suggestions regarding unit makes sense, in my opinion. @bf-bryants: do you agree with the 'in-house' default and simplification of displayName? In a simple case, all that is left would be the id as in:

context: {
  temperature: {
    unit: {
      id: 'C'
    }
  }
}

Should we offer either a simplified form of:

context: {
  temperature: {
    unit: 'C'
  }
}

or a complex object with mandatory id & namespace URI:

context: {
  temperature: {
    unit: {
      id: 'C',
      namespace: 'https://eclipse.org/customOrPublicDefinition'
    }
  }
}

bgusach commented 6 years ago

@ameinhardt , that'd work, but what about always having a flat object for each dimension?

No namespace:

context: {
  temperature: {
    unitID: "C" 
  }
}

Namespace and possibly other stuff:

context: {
  temperature: {
    unitID: "C",
    namespace: "...",
    otherStuff: "..."   
  }
}

I think that makes parsing and validating easier.

bf-bryants commented 6 years ago

I agree with @bgusach:

localised texts are not needed.
unit namespace should be optional (default empty, meaning anonymous/unknown namespace).
the flatter structure is easier for parsing.

However, the previous two comments have inadvertently made a case for allowing a numeric identifier where the namespace has them (ie: in addition to the label):

You've both named the unit C (Coulomb) for a field called "temperature" - I guess you meant °C. If a numeric ID is available, this problem can be avoided and the label can be skipped too.

This leaves me with the following per measurement field:

unitNamespace or unitUri [string]
unit or unitName or similar [string]
unitId or unitID [number]

The last two of these could be merged if we allow the unit to be a string or a number - but I don't know if the schema will support that.

While I'm here: would it be sensible to allow a per-message default unit namespace? If a namespace is used at all, it's likely that multiple fields will use the same namespace, so we could avoid some text duplication.

The dataType field was primarily created so that small amounts of binary data could be transported. It is correct that the JS-native types are implicitly handled. If we assume that strings are just plain strings unless except otherwise marked, then it could indeed be reduced to "BASE64" or nothing!

Base 64 was chosen because it's the de facto standard, with MIME et al. Is there something else that should be included here?

It should still explicitly stated that all of a field's values must be of the same type for implicitly typed values. Thus you could never have this:

    "temperature": [ 1.0, false, "3" ]

Steve

bgusach commented 6 years ago

Hi @bf-bryants ,

However, the previous two comments have inadvertently made a case for allowing a numeric identifier where the namespace has them (ie: in addition to the label): You've both named the unit C (Coulomb) for a field called "temperature" - I guess you meant °C. If a numeric ID is available, this problem can be avoided and the label can be skipped too.

I didn't think too much about that and copied/modified @ameinhardt's example (that C). I agree that something like celsius would be way more appropriated as an ID, but at the end of the day, users of this protocol are free to choose within their namespace whether C stands for celsius degrees, coulombs or carrots 😄

I'd say it is a bad idea to have two possible identificators (unitName, unitID) for the engineering units, because you have to specify (and implementation must follow) what should happen if both IDs available. Not a bit deal, but complexity slowly creeps in this way. We should stick to only one identificator, and in my opinion string IDs win over numbers since they may or may not be descriptive, but numeric values are never very descriptive. If somebody insists on using the number 76 for celsius degrees, that's fine, he or she can use the string "76". I think it is not worth modifying the schema to allow strings and numbers.

Side note: we should restrict the string ID to something like [a-zA-Z0-9]+

The dataType field was primarily created so that small amounts of binary data could be transported. It is correct that the JS-native types are implicitly handled. If we assume that strings are just plain strings unless except otherwise marked, then it could indeed be reduced to "BASE64" or nothing!

I'd say it is not very elegant to have type information both in the JSON format and within the payload. Moreover from my limited point of view, binary data is not very useful if you don't have a proper description of what it is (e.g. "it is a jpeg", or "the first byte means this, the second one that", etc) , so I'd personally stick to using the description (either inlined or in the namespace documentation) and saying something like base64 encoded bytes, meaning blablabla....

However, I don't really know all the use cases from the real world, and if it is a must for you that base64 strings are automatically converted to binary data on the consumer side, I guess there is no way around using an extra field (or maybe using some kind of prefix like base64:i2uk2398ah89f9h8qw4... ??... but meh...). It's your call.

Base 64 was chosen because it's the de facto standard, with MIME et al. Is there something else that should be included here?

Technically other encodings could be used to embed binary data in JSON, as base85 or base91 (which are more bandwith efficient), but you are right there: base64 is the de facto standard and the improvements of other encodings are probably not worth the hassle.

While I'm here: would it be sensible to allow a per-message default unit namespace? If a namespace is used at all, it's likely that multiple fields will use the same namespace, so we could avoid some text duplication.

That's a good idea. Default, and if some units want to use another namespace, they're free to override the default one. Probably the context object is the right place to define this default.

bf-bryants commented 6 years ago

Hi,

After letting it bounce about in my head for a couple of days, I think you're right about using only one field for a unit. As you point out, putting a number into a string is an option; it's easy to detect.

I'm not so sure about restricting content. I feel that it should be possible to specify at least SI units directly, which means you'd also need superscript numbers, the degree and slash symbols, eg: it should be possible to write "km/h" etc.

Here's an example of a binary data use case: sending a current tag ID from an RFID reader along with other measurement data such as conveyor speed etc. There's no format or meaning to the data, other than that it's an ID. We know it's a tag ID from the combination of the field name and the PPMP message's device ID (we use UUIDs); we use that to look up configuration information.

Note that the conversion to PPMP/JSON is often done by a little embedded field device (eg: IO-Link master), with limited context and resources. While it can read configuration information about its connected devices, interpretation of data is generally not possible.

My suggestion would be to use an additional property (which is outside of the standard's scope). An alternative would be to add an optional field for the MIME type to the context. Neither is suitable for complex data descriptions though. As it's not far how far we should go with data description details, I am inclined not do do it at all; I therefore expect that the receiver either ignores such fields, or they know how to deal with the data somehow.

I think it would be good to omit explicit type information if it's implicitly and non-ambiguously available. Whether something is a number or a string is clear, but interpreting string content can be a source of problems. I would avoid using prefixes inside string data, as we then have a lot more effort to ensure that we're not looking at a string that happened to start with the same characters. The same problem occurs if a string happens to start with the same characters that base64 uses (or bases 85, 91 or 122). I must assume a string is just a string unless explicitly marked as being something else.

Note that this can be optional (in my opinion). If we can mark a specific string field as using a certain encoding by using out-of-band configuration, then the PPMP message can omit that information. As above, anybody who can't interpret the string content can ignore it.

BTW: the only reason for this is because JSON has no binary data type of its own. :-(

TL;DR:

Units: use a text ID and a namespace (default empty). Alternative unit namespace default can be set in context.
Binary data content: content type outside of spec.
Data types (excluding binary data in strings): use JSON type, no need for explicit context.
Encoding binary data in strings: explicitly name encoding type, otherwise it's a non-binary string!

Steve

bgusach commented 6 years ago

Hi @bf-bryants

I'm not so sure about restricting content. I feel that it should be possible to specify at least SI units directly, which means you'd also need superscript numbers, the degree and slash symbols, eg: it should be possible to write "km/h" etc.

I think the unit IDs should analog to variable names in a programming language, among others they should be readable and hard to confuse. You suggested using numbers as IDs for the engineering units, and it makes sense to be from the "hard to confuse" point of view, 11 (although not very readable). If we allow any character, we may end up with things like ° C and °C which are nice and readable, but very easy to confuse. I'd go for a very restricted ASCII set, where slash could be allowed like in km/h, but only lower case (or upper, I don't mind at all), and no white spaces.

My suggestion would be to use an additional property (which is outside of the standard's scope). An alternative would be to add an optional field for the MIME type to the context. Neither is suitable for complex data descriptions though. As it's not far how far we should go with data description details, I am inclined not do do it at all; I therefore expect that the receiver either ignores such fields, or they know how to deal with the data somehow.

I'm not sure I understand what you meant in that paragraph. Could you try to explain it in another way and maybe give examples?

I think it would be good to omit explicit type information if it's implicitly and non-ambiguously available. Whether something is a number or a string is clear, but interpreting string content can be a source of problems. I would avoid using prefixes inside string data, as we then have a lot more effort to ensure that we're not looking at a string that happened to start with the same characters. The same problem occurs if a string happens to start with the same characters that base64 uses (or bases 85, 91 or 122). I must assume a string is just a string unless explicitly marked as being something else.

Yup, that with the prefixes was rather a dummy "brainstormy" idea. Could work, but to make it fast we should prefix all the strings to know if they are strings, base-xx or some other exotic stuff. Meh... :)

BTW: the only reason for this is because JSON has no binary data type of its own. :-(

That is true. In the FAQ it is stated that PPMP happens to be JSON, but could be changed to something else if necessary. Although that was said regarding size, this problem could be a reason to move to some other standard (something like messagepack, protobuf, BSON, ...?). What do you think @ameinhardt ?

Thanks, Bor

bf-bryants commented 6 years ago

Hi,

I'd go for a very restricted ASCII set, where slash could be allowed like in km/h, but only lower case (or upper, I don't mind at all), and no white spaces.

That won't work because SI units are case sensitive - for example with mega (M) and milli (m). I see only two choices here - either we use a correct set of rules for validating the content, or we don't validate it and live with the fact some people will put rubbish in that field. I am currently leaning towards the second, as getting the first one right will delay the V3 release too long. :-)

I'm not sure I understand what you meant in that paragraph.

That was about how to describe binary content sent with measurement data. Short answer: "Don't!"

It's also possible to use an "additional property" (aka custom field), but such a field is not part of the PPMP spec by virtue of it being a custom field. If the receiver doesn't know what's in the binary data field, they should ignore it.

[...] this problem could be a reason to move to some other standard (something like messagepack, protobuf, BSON, ...?)

We're currently looking at binary formats for lower-level direct data exchanging where performance is more important. I like how protobuf has an explicit definition layer. We're also looking at CBOR, which is a strict JSON superset but explicitly supports various number types and binary data.

However, as a general interchange format, JSON is very widely accepted and is human readable. I would actually be surprised if PPMP starts using something else.

Best regards,

Steve

bgusach commented 6 years ago

Hi,

That won't work because SI units are case sensitive - for example with mega (M) and milli (m). I see only two choices here - either we use a correct set of rules for validating the content, or we don't validate it and live with the fact some people will put rubbish in that field. I am currently leaning towards the second, as getting the first one right will delay the V3 release too long. :-)

You suggested using unitID as an integer because it would be easy to programmatically identify engineering units, which I also think is an important thing. However, if we allow for something more than a small boring set of lowercase plus underscore chars or similar, you start getting messy IDs like coolIdea or meters per second (add a tab there), which are very error prone. I don't see any problem writing mili_amper instead of mA.

bf-bryants commented 6 years ago

Yes - that was an idea from the OPC/UA spec. A namespace URI gives us a complete enumerated set of possibilities, and the unit ID refers to one of them. It seemed likely that devices will already have this information (eg: for OPC/UA, IO-link etc), so re-using it for PPMP would make life easier for them.

The problem is what to do when no namespace is used, as the numeric ID has no meaning. A string allows at least something to be set - but as you quite rightly point out, people will put all sorts of rubbish in there. If we're going to go with a strict validation, I'd prefer a numeric ID from a specific namespace.

I don't see any problem writing mili_amper instead of mA.

You may find that other people do have a problem with that, given that the symbol mA is part of an international standard, whereas mili_amper is not. I would prefer to use the existing standard rather than invent a new one.

As I mentioned previously, I'd prefer not to validate strings at all. A numeric ID from an external namespace can't get messy. My opinion is that we are either very strict (namespace+number) or we leave it open (unvalidated string).

Steve

ameinhardt commented 6 years ago

Hi @bf-bryants, @bgusach, thanks for the valuable discussion! I would like to finalize the v3 proposal by end of this month, so we should find a conclusion. Keep in mind that v3 is supposed to be a sound basis (for extensions) and doens't need to be an all-defined-standard. I keep proposing: The unit id is a not-clearly defined label if no namespace is defined. If a namespace is given, it shall be treated as nonambiguous id in that namespace. Basically as outline in https://github.com/eclipse/unide/issues/35#issuecomment-385910371 I understand that flat maps could be easier to parse by computers. I prefer to keep a grouping, though. Parsers need to handle blocks in other areas as well. We should be consistent. Instead, if someone doesn't want to transfer objects, he could flatten with some kind of dot notation (costs some 10 ms to flatten/unflatten)

flatten({
  "content-spec": "urn:spec://eclipse.org/unide/measurement-message#v3",
  "device": {
    "id": "a4927dad-58d4-4580-b460-79cefd56775b"
  },
  "measurements": [
    {
      "ts": "2018-05-28T07:41:31.603Z",
      "series": {
        "time": [
          0,
          23,
          24
        ],
        "temp.1": [
          45.4231,
          46.4222,
          44.2432
        ],
        "temp.2": [
          42
        ]
      }
    }
  ]
})

results in

{
  "content-spec": "urn:spec://eclipse.org/unide/measurement-message#v3",
  "device.id": "a4927dad-58d4-4580-b460-79cefd56775b",
  "measurements.0.ts": "2018-05-28T07:41:31.603Z",
  "measurements.0.series.time.0": 0,
  "measurements.0.series.time.1": 23,
  "measurements.0.series.time.2": 24,
  "measurements.0.series.temp\\.1.0": 45.4231,
  "measurements.0.series.temp\\.1.1": 46.4222,
  "measurements.0.series.temp\\.1.2": 44.2432,
  "measurements.0.series.temp\\.2.0": 42
}

In the same way, one could apply additional transformation like cbor, gzip or other to the standard payload.

@bf-bryants, @bgusach, can you give a (preferably final) example, taking the discussion into account?

bgusach commented 6 years ago

@bf-bryants I stand by my opinion: a restricted string has the same reliability as an integer, and offers "good enough" readability. A free string offers great readability but loses all reliability for programming purposes. Just an example, the following two strings are different:

And I still think something like milli_ampere or micro_ampere is perfectly readable for everybody.

That's my opinion, I guess it's up to our BDFL @ameinhardt to decide 😄

ameinhardt commented 6 years ago

in my opinion that's up to the namespace. The ids in a custom namespace could be numbers as well as clearly defined strings. If no namespace is given, it's a "free string" which merely serves as a label or needs additional (outside PPMP payload) agreements between sender & receiver.

bgusach commented 6 years ago

@ameinhardt

I understand that flat maps could be easier to parse by computers. I prefer to keep a grouping, though. Parsers need to handle blocks in other areas as well. We should be consistent.

The most important aspect for me is not about grouping or not, but having a variable schema, i.e. having either an string or an object for the unit key. I think it is unnecessary complexity.

Then, almost as a taste thing, I don't see the benefit of the grouping. Having unit and namespace directly under context.temperature looks good enough to me. But having it grouped in a subobject is also ok. I think is not worth arguing that little detail :)

In other words, my first choice would be:

context: {
  temperature: {
    unitID: "C", 
    namespace: "cool-namespace"  // optional
  }
}

And then this:

context: {
  temperature: {
    unit: { 
        id: "C", 
        namespace: "cool-namespace"  // optional
    } 
....

But I'm against allowing both a string or an object under the unit key, depending on whether there is a namespace or not.

bgusach commented 6 years ago

in my opinion that's up to the namespace. The ids in a custom namespace could be numbers as well as clearly defined strings. If no namespace is given, it's a "free string" which merely serves as a label or needs additional (outside PPMP payload) agreements between sender & receiver.

I don't agree with that. If you allow a "free string", you allow it for every case: with or without namespace. And free strings are terrible IDs.

bf-bryants commented 6 years ago

Hi,

I'll combine my replies into one message.

I understand that flat maps could be easier to parse by computers.

I'm not aware of JSON parsers having this problem. On the contrary, the structure (as in V2) allowed us to directly reference sub-elements as complete objects. I also prefer the grouped structure.

Transformations such as flattening, CBOR, Gzip etc would be better left separate from V3 in my opinion - so that we can get the V3 content field definitions concluded in the very close future!

It would, however, be an interesting discussion point for V3.1 or V4.

I keep proposing: The unit id is a not-clearly defined label if no namespace is defined. If a namespace is given, it shall be treated as nonambiguous id in that namespace.

This is still my position.

It doesn't matter that the key is a free string, as it must still exactly match a key in the namespace. We use 'free strings' as key names in other places in PPMP, and that's not causing any problems.

A unit ID without a namespace is not a useful definition, and is at best only of use as an advisory label; I see no point in adding restrictions to it.

BTW: I have no preference as to whether a unit definition is in its own object or flat.

Steve

ameinhardt commented 5 years ago

Probably this discussion is not yet concluded, so I would second

It would, however, be an interesting discussion point for V3.1 or V4. For v3, the current context object is some progress though