Question: Is there a standard signal serialization format?

COVESA / vehicle_signal_specification

Vehicle Signal Specification - standardized way to describe automotive data

Mozilla Public License 2.0

324 stars 166 forks source link

Question: Is there a standard signal serialization format? #627

Open sophokles73 opened 1 year ago

sophokles73 commented 1 year ago

I was wondering if there is a standard mechanism/format for serializing VSS data to a byte array/stream. This could be used to transfer signal data from a vehicle to a back end application using e.g. MQTT and or HTTP. Ideally, such a format would not be overly verbose. In particular, the VSS data entry names like Vehicle.Powertrain.CombustionEngine.DieselExhaustFluid.Level could increase the payload size dramatically, so some form of meta data based serialization like protobuf comes to mind. However, I haven't (yet) found a corresponding protobuf definition, or am I mistaken?

erikbosch commented 1 year ago

There is a vspec2protobuf.py tool in vss-tools which I believe was created for that purpose. There have also been some ideas to use the UUID concept to get short unique identifiers. existing downstream protocols like VISS and KUKSA.val gRPC all use full name to refer to signals, but representation of data differs (JSON in VISS, Proto native types in KUKSA.val)

sophokles73 commented 1 year ago

Thank you for the pointer, @erikbosch

My understanding of the vspec2protobuf.py script is that it simply iterates over the Data Entry definitions and creates corresponding protobuf Message definitions. I am a little concerned regarding backwards compatibility when e.g. a new minor version of VSS is being released which maybe changes the ordering of the Data Entry defs and/or introduces new entries. My current feeling is that it will not necessarily be possible to use the resulting proto file to de-serialize a protobuf message that had been created using the original protobuf definition because of the property identifiers being shuffled, overridden etc.

FMPOV it would be helpful to introduce fixed identifiers for the VSS Data Entries which remain constant over time and cannot be reused. You mentioned the use of UUIDs being discussed for that matter. That could be one way of doing it. Another option with a smaller footprint might be to use a similar scheme like the one being used in e.g. SNMP, where you assign a simple integer (counting from 1 up) to each node/leaf in each (sub-)tree. These identifiers could then be used as the property IDs in the protbof Message definitions and could also be used as a more compact identifier in other serialization formats.

erikbosch commented 1 year ago

@adobekan - what is the status of your ideas to refactor UUID handling? Is it it still just a thought - it seems to be related to the comment from Kai above

adobekan commented 1 year ago

proto could be used but problem would be, how do you exchange and manage schema between integration points. What we were thinking is something related to what @sophokles73 is mentioning. It was related to short UUID element, e.g. 3 bytes (1byte for version/layers/source, e.g. what is public vs private) (2bytes for fixed id of each element, which stays with leaf after creation)

@erikbosch Still on my ToDo list, soon we will start working on this.

sophokles73 commented 1 year ago

Here's an example of what I have in mind:

#
# The vehicle branch for highlevel vehicle signals and attributes.
#
Vehicle:
  type: branch
  id: 1
  description: High-level vehicle data.

# Include the Vehicle/Vehicle.vspec file and attach all its signals under the
# Vehicle branch created above.

#include Vehicle/Vehicle.vspec Vehicle

Now define the VehicleIdentification subtree

VehicleIdentification:
  type: branch
  id: 1
  description: Attributes that identify a vehicle.

VehicleIdentification.VIN:
  datatype: string
  type: attribute
  id: 1
  description: 17-character Vehicle Identification Number (VIN) as defined by ISO 3779.

VehicleIdentification.WMI:
  datatype: string
  type: attribute
  id: 2
  description: 3-character World Manufacturer Identification (WMI) as defined by ISO 3780.

VehicleIdentification.Brand:
  datatype: string
  type: attribute
  id: 3
  description: Vehicle brand or manufacturer.

VehicleIdentification.Model:
  datatype: string
  type: attribute
  id: 4
  description: Vehicle model.

VehicleIdentification.Year:
  datatype: uint16
  type: attribute
  id: 5
  description: Model year of the vehicle.

Now the Vehicle.VehicleIdentification.Year data entry could also be referred to by 1.1.5 (either as a String or a sequence of uint16).

In a protopuf definition, this could also be used:

message Vehicle {
  VehicleVehicleIdentification VehicleIdentification = 1;
  ....
}

message VehicleVehicleIdentification {
  string VIN = 1;
  string WMI = 2;
  string Brand = 3;
  string Model = 4;
  uint32 Year = 5;
  ...
}

The IDs used in the message definitions are the values of the corresponding Data Entry definitions' id properties. These cannot be changed over time and if a new property is being added, a new id value is being defined in the vspec. Similarly, if a property is being removed, its id value will not be reused.

This way, it should be quite simple to make sure that protobuf message definitions generated from the vspec files remain backward compatible. The ids could also be used in other serialization formats like JSON in order to increase the payload vs. meta data ratio. It makes a big difference if I use Vehicle.VehicleIdentification.Year or just 1.1.5 and if I have a message that contains dozens of data points it adds up quite substantially ...

erikbosch commented 1 year ago

I like the idea, but some thoughts:

Do we want the "final" identifier to be a string or a uint? It would be no problem for the Year-example above for vss-tools to generate (in the resulting CSV/Json/Yaml) both a string identifier global_id_str: "1.1.5" and/or a global numeric id global_id: 0x010105 (one byte for each position, supporting at most 8 levels of identifiers so that an identifier always will fit in a uint64?)
We must also consider instances, there it is more difficult to specify explicit identifiers, we either need to let the tool "invent" identifiers or have a more complex statements like instance_id: [Row1.DriverSide=1, Row1.PassengerSide=2, ...]
We may need a mechanism to block identifiers that already have been used, but where the signal/branch has been deleted
We must agree on how much you are allowed to change a signal before it shall get a new identifier. Like if you change type/datatype/unit - do you then need to change identifier?

adobekan commented 1 year ago

Here's an example of what I have in mind:
#
# The vehicle branch for highlevel vehicle signals and attributes.
#
Vehicle:
  type: branch
  id: 1
  description: High-level vehicle data.

# Include the Vehicle/Vehicle.vspec file and attach all its signals under the
# Vehicle branch created above.

#include Vehicle/Vehicle.vspec Vehicle
Now define the VehicleIdentification subtree
VehicleIdentification:
  type: branch
  id: 1
  description: Attributes that identify a vehicle.

VehicleIdentification.VIN:
  datatype: string
  type: attribute
  id: 1
  description: 17-character Vehicle Identification Number (VIN) as defined by ISO 3779.

VehicleIdentification.WMI:
  datatype: string
  type: attribute
  id: 2
  description: 3-character World Manufacturer Identification (WMI) as defined by ISO 3780.

VehicleIdentification.Brand:
  datatype: string
  type: attribute
  id: 3
  description: Vehicle brand or manufacturer.

VehicleIdentification.Model:
  datatype: string
  type: attribute
  id: 4
  description: Vehicle model.

VehicleIdentification.Year:
  datatype: uint16
  type: attribute
  id: 5
  description: Model year of the vehicle.
Now the Vehicle.VehicleIdentification.Year data entry could also be referred to by 1.1.5 (either as a String or a sequence of uint16).

In a protopuf definition, this could also be used:
message Vehicle {
  VehicleVehicleIdentification VehicleIdentification = 1;
  ....
}

message VehicleVehicleIdentification {
  string VIN = 1;
  string WMI = 2;
  string Brand = 3;
  string Model = 4;
  uint32 Year = 5;
  ...
}
The IDs used in the message definitions are the values of the corresponding Data Entry definitions' id properties. These cannot be changed over time and if a new property is being added, a new id value is being defined in the vspec. Similarly, if a property is being removed, its id value will not be reused.

This way, it should be quite simple to make sure that protobuf message definitions generated from the vspec files remain backward compatible. The ids could also be used in other serialization formats like JSON in order to increase the payload vs. meta data ratio. It makes a big difference if I use Vehicle.VehicleIdentification.Year or just 1.1.5 and if I have a message that contains dozens of data points it adds up quite substantially ...

I like your proposal with id tag, but i think it has to be a bit more unique. I would say that each leaf needs short unique number that will stay with that leaf and it will even allow us to trace the leaf. Plus a value for overlays. Then we can identify if some leaf is coming from main repo, or it is new concept layer, or private modification.

Where I see challenges.

What about scenarios when we move leaf, e.g. we decide to move leaf one branch up or down. Or we are doing some reorg in the new version. How shall we handle this? example of 1.1.5. could become 1.2.5.
If you use protobuf as you have proposed. How can we allow combination of dynamic linking of leafs in each message? e.g. my system is not always updating all leafs at the same time from VehicleVehicleIdentification. Moreover, we can consider scenario where we can combine as well leafs from different part of VSS tree. Call it MyCustomMessage.

adobekan commented 1 year ago

I like the idea, but some thoughts:

* Do we want the "final" identifier to be a string or a uint? It would be no problem for the Year-example above for vss-tools to generate (in the resulting CSV/Json/Yaml) both a string identifier `global_id_str: "1.1.5"` and/or a global numeric id `global_id: 0x010105` (one byte for each position, supporting at most 8 levels of identifiers so that an identifier always will fit in a uint64?)

* We must also consider instances, there it is more difficult to specify explicit identifiers, we either need to let the tool "invent" identifiers or have a more complex statements like `instance_id: [Row1.DriverSide=1, Row1.PassengerSide=2, ...]`

* We may need a mechanism to block identifiers that already have been used, but where the signal/branch has been deleted

* We must agree on how much you are allowed to change a signal before it shall get a new identifier. Like if you change type/datatype/unit - do you then need to change identifier?

I would prefer that we go for hex value 4 bytes at least, byte 0 -> layer concepts, byte 1-3 generate id should be enough to cover us for next few decades. :) Handling instances will be really challenging. I guess we should discuss this and see what would be the easiest way.

sophokles73 commented 1 year ago

@erikbosch

We may need a mechanism to block identifiers that already have been used, but where the signal/branch has been deleted

When making incompatible changes like deleting/renaming a Data Entry or changing its type in an incompatible way, then we will need to create a new major version of the VSS spec, won't we? An application that was built using, say, VSS version 3 can (in general) not be expected to work with VSS version 4 without any alterations, right? Consequently, I would assume that it would be ok to change the numeric identifiers in between major version changes in an incompatible way as well, or am I mistaken?

IMHO this means that we can only uniquely identify a Data Entry by means of the combination of the VSS (major) version and the path identifier (e.g. 1.1.5). We could thus also include the VSS (major) version in the path itself, e.g. we could prefix the path with the major version: 3.1.1.5.

We must agree on how much you are allowed to change a signal before it shall get a new identifier. Like if you change type/datatype/unit - do you then need to change identifier?

IMHO this will be analogous to how much you can change before you need to assign a new name. (and thus need to do it in a new major version).

erikbosch commented 1 year ago

@sophokles73 - for transport purposes I believe you are correct, but if we want to use the identifier also for backend purposes it might be relevant. Like if a server either supports multiple VSS-versions or needs to migrate stored historical data from version X to version Y. If signal X.Y change type (and meaning) from bool to int then old historical values does not make sense, in the backend database it must be treated as "different signals". On the other hand, if we move/rename "Vehicle.Speed" to "Vehicle.Status.Speed" we could theoretically reuse/keep/migrate the old values, if the meaning of them has not changed.

SebastianSchildt commented 1 year ago

I think one major question is, do you want identifiers to save bytes/processing for serialisation or is it important they also represent the underlying model.

The first case is easy, and may be all that is needed for many applications: You just hash the path name with a robust hash, and use however many bytes you are comfortable with (wit a static model you can even check for collisions, so very few bytes ok). If doing so the identifier for Vehicle.Speed has the same length as Vehicle.Cabin.Seat.Row1.x.y.v, however it requires that in your deployment you make sure, both ends are referring to the same VSS model, as you can not make sure that the model/metadata under a given ID is the same.

The other extreme is some "Merkle-style" hashing where you has also all the VSS metadata, and all Childs. That way same has on a given branch means exactly the same model beneath. That would be good to see that "hey, Vehicle.ADAS.*" model really is 100% the same, but for practical purposes just adding one signal below destroys similarity all the way up.

Maybe more practical is just doing it on a leaf basis: VSS metadata and Path in a hash.

Tracking "movements" of data in the tee however is really hard with this, I am not sure there is a better option for that than really having a kind of "id database" created, that you ship with the spec, and where you could manually do such stuff, if you really want. Don't see a good way to do this automatically, because obviously going the hashing way to reduce data you can not include path, but then you also do not want the id to change, if you e.g. fix a typo in comment or description. If you leave all those volatile things out, suddenly the system would determine that everything that is "uint8 with min0 max 100" is really "the same". But as pointed out already, maybe there is also not a real use case for that, becasue if model ic changed that way, up the version number/mark it as different

I think no golden bullet here, but for the OP request of "using it for more efficient serialisation/adressing", I feel hashing paths and making sure via deployment/tech stack both sides are on the "same" VSS model is best. That would even be more robust than the "numbering" scheme in cases, where there are composite model, where stuff is added via e.g. overlays, or left out. Becasue as long as you don CHANGE metadata of a datapoint, they can still be reliably referenced.

sophokles73 commented 1 year ago

My original concern in this issue was:

How do we make sure that a message's property numbers do not change in an incompatible way during minor version changes of the VSS spec?

I do not really understand how the discussion about moving signals across the tree is related to this problem as FMPOV doing something like that will always result in a breaking change which would result in a major version change. So I wonder if automagical migration of data across major version changes actually is a use case/requirement? So far I haven't read anything about that in the context of VSS ...

However, the problem I have stated above is a real world issue/concern that I ran into as soon as I started transmitting any VSS data between components that have not been implemented as part of the same project/system.

erikbosch commented 1 year ago

Warning - very long comment! Feedback if this would be a reasonable approach is welcome!

I came up with a possible idea for managing unique identifiers and handling version control. What about having a file id.csv (or similar) with lines containing <id>,<path>,<hash>, where id is a numeric identifier and hash something that represent important characteristics of the signal like datatype/unit and possibly also description. I.e. having a list like this:

1, Vehicle.A,0x783487
2, Vehicle.B,0x932765
3, Vehicle.C,0x178333
4, Vehicle.C.A,0x437230
5, Vehicle.C.B,0x947232

If a new signal is added to the standard catalog the list needs to be extended. Tooling could help with that.

1, Vehicle.A,0x783487
2, Vehicle.B,0x932765
3, Vehicle.C,0x178333
4, Vehicle.C.A,0x437230
5, Vehicle.C.B,0x947232
6, Vehicle.D,0x555555

If a signal is renamed but semantic meaning and hash remains then you can just add a new line with the same identifier as before. For example if Vehicle.A is to be renamed to Vehicle.AA one could just add a line for id 1

1, Vehicle.A,0x783487
1, Vehicle.AA,0x783487 // Second instance of 1, same hash as important fields are unchanged
2, Vehicle.B,0x932765
3, Vehicle.C,0x178333
4, Vehicle.C.A,0x437230
5, Vehicle.C.B,0x947232

That would practically mean that Vehicle.A and Vehicle.AA could be treated as synonyms. If backward compatibility is not needed the line for Vehicle.A could be removed.

One the other hand if meaning of a signal change, for example new unit or new description the new hash must be assigned to a new id. Like if Vehicle.A change unit from km/h to m/s

1, Vehicle.A,0x783487
2, Vehicle.B,0x932765
3, Vehicle.C,0x178333
4, Vehicle.C.A,0x437230
5, Vehicle.C.B,0x947232
6, Vehicle.A,0x343487 // Second instance of Vehicle.A, using new id

But if the change affects hash but semantics are the same we could just add the new hash but with the old id Example 1: We have added unit: percent, but it has always been implicit that the signal describes a percentage value Example 2: We have added min/max value, but it should not have any practical implications

1, Vehicle.A,0x783487
1, Vehicle.A,0x343487 // Second instance of Vehicle.A, keeping same id
2, Vehicle.B,0x932765
3, Vehicle.C,0x178333
4, Vehicle.C.A,0x437230
5, Vehicle.C.B,0x947232

This would work for instances as well, like PassengerSide/DriverSide example. They would have the same hash, but PassengerSide and DriverSide would have different index. We could even define Left/Right as "aliases", possibly in a different file as vehicle-specific overlay

1, Vehicle.Seat.DriverSide.Position,0x783487 // First occurance of id shows "official" name
2, Vehicle.Seat.PassengerSide.Position,0x783487
1, Vehicle.Seat.Left.Position,0x783487 // Works as an alias, but needs to be customized depending on if car is LHD or RHD
2, Vehicle.Seat.Right.Position,0x783487

One could even think of id-ranges so that any custom signals added must have ID>0xFFFF to avoid possible collision with future VSS standard signals.

A file like this could potentially be useful also in cases where you do not need the id as identifier for write/read/transmit. A tool like KUKSA.val could do a lookup in the file and if someone requests Vehicle.Seat.Left.Position then KUKSA.val could replace it with a call for Vehicle.Seat.DriverSide.Position. We could also integrate include/embed the information when we generate JSON/Yaml, so that tools like KUKSA.val easily can find id, hash and synonyms in the generated and expanded JSON/Yaml

adobekan commented 1 year ago

@erikbosch

I started scribbling something similar. I wanted to use yaml here, and then with overlay attach IDs to the tree. In this case even instances would not be too complicated to handle.

As you mention, additional check when the leaf is moved but ID not fitting, or datatype changed. We can check in the tooling. Also in yaml structure it would be easy to append leaf changes and comments.

  def __init__(self, offset=0):
    self.offset = offset       
    self.layer_bits = 7  # Number of bits reserved for layers
    self.incremental_bits = 24  # Number of bits for incremental value

    # Calculate the maximum values for each part
    self.max_layer_value = 2 ** self.layer_bits - 1
    self.max_incremental_value = 2 ** self.incremental_bits - 1

    # Initialize the current values
    self.current_layer = 0
    self.current_incremental = self.offset % (self.max_incremental_value + 1)

def generate_uid(self):
    # Increment the incremental value
    self.current_incremental = (self.current_incremental + 1) % (self.max_incremental_value + 1)

    # Build the UID by combining the layer and incremental values
    uid = (self.current_layer << self.incremental_bits) | self.current_incremental

    # Encode the UID as hexadecimal
    uid_hex = hex(uid)[2:].zfill(8)

    return uid_hex

def set_layer(self, layer):
    if layer < 0 or layer > self.max_layer_value:
        raise ValueError(f"Layer value should be between 0 and {self.max_layer_value}")

    if layer < 64:
        print("Note: The first 64 values of layer_bits are reserved for COVESA public repo.")

    self.current_layer = layer

def set_offset(self, offset):
    if offset < 0 or offset > self.max_incremental_value:
        raise ValueError(f"Offset value should be between 0 and {self.max_incremental_value}")

    self.offset = offset
    self.current_incremental = offset % (self.max_incremental_value + 1)

UlfBj commented 1 year ago

Another alternative that is implemented at the VISSv2 reference implementation as an experimental compression is to create an array of all leaf node paths in the tree, and then sort it. The index into the array can then be used to uniquely represent the path of each leaf node. This can be extended to include all nodes, not only the leaf nodes. Encoding/decoding is quite efficient.

The hashing operation proposed in other alternatives is here instead a sorting operation. A uint16, two bytes, is sufficient for trees with max 65535 leaf nodes. The problem of making sure the same tree version is used at both ends is here, as I believe in most other alternatives.

adobekan commented 1 year ago

@UlfBj

Could you please provide a link or an example? If i try to follow the explanation, would not this already cause issues if vehicles are not configured with exactly same number of leafs with same names? e.g. Vehicle A having 900 leafs (random) Vehicle B having 600 leafs (random) Vehicle C having name changes in one leaf out of 700 (random)

I agree that 2 bytes would be enough specially if you combine layer mapping.

sophokles73 commented 1 year ago

@UlfBj

Another alternative that is implemented at the VISSv2 reference implementation as an experimental compression is to create an array of all leaf node paths in the tree, and then sort it. The index into the array can then be used to uniquely represent the path of each leaf node. This can be extended to include all nodes, not only the leaf nodes. Encoding/decoding is quite efficient.

What about adding a new signal to an existing node? This represents a backward compatible change to the VSS tree but would most likely screw up the array index, wouldn't it? If we were using the array index as the property IDs in the protobuf file this would lead to a non-backward compatible protobuf definition, wouldn't it?

UlfBj commented 1 year ago

@adobekan The solution requires that both the server and the client has access to the same version of the tree/path array. A scenario could typically be that a client initially sends a request to get the tree version data from the server. This interaction is not using path compression. The client then needs to make sure it has access to the tree/path array of that version. So it requires a version synchronization between client and server as a path index is valid only for a specific version.

Regarding an example, the client on this link implements it, in the protobuf compression among a few different compression experiments. https://github.com/w3c/automotive-viss2/tree/master/client/client-1.0/compress_client The encoding/decoding used by client and server is done in this file: https://github.com/w3c/automotive-viss2/blob/master/utils/pbutils.go

UlfBj commented 1 year ago

@sophokles73 If a new node is added to an existing tree, the tree should also have a version update. Assuming that this new version of the tree is accessible by both end points, then a version synchronization like described above should fix it. There is no reason to use the index as the property Id in protobuf, it should rather be treated as data in the message. That is hw it is done here. https://github.com/w3c/automotive-viss2/blob/master/protobuf/VISSv2messages.proto

adobekan commented 1 year ago

`message GetRequestMessage {
        string Path = 1;
        optional FilterExpressions Filter = 2;
        optional string Authorization = 3;
        optional string RequestId = 4;
}

message SetRequestMessage {
        string Path = 1;
        string Value = 2;
        optional string Authorization = 3;
        optional string RequestId = 4;
}`

Here if i look at proto file, basically you have something like hashmap but you are not using benefits of protobuf when it comes to reducing message size. You are still using path as identifier, and payload is always string what can be quite dangerous on version changes. I would suggest at least in this approach define Value as oneOf, which is supported in proto.

Other challenge when it comes to array, sorting and compression is related to number of leafs, we can not assume that each vehicle will have support for all leafs. This is nothing related to version of VSS. SeatHeating status might not be existing in every single vehicle in the fleet, it might be just not there as feature and then this might cause additional challenges. Off course one can always think about ways how to handle this and keep 10k different variations for 30mil vehicles and involve some process of handshake.

This is why i would prefer to have small 2-3 bytes static IDs assigned to leafs not directly in vspec files, and then you can get close to numbers of static binary serialization when it comes to message size but as well keep historical tracking of each leaf.

UlfBj commented 1 year ago

If you look at the DataPackages message below, which is what is snt back in the response, path can there be an int32. The same can of course be done in the request message, I just did not implement it. Most paths are likely to be found in response messages anyway.

message DataPackages { message DataPackage { optional string Path = 1; optional int32 PathC = 2;

    message DataPoint {
        string Value = 1;
        optional string Ts = 2;
        optional int32 TsC = 3;
    }
    repeated DataPoint Dp = 3;
}
repeated DataPackage Data = 1;

}