Open Lance-Drane opened 4 months ago
Note that most of this discussion applies only to the workflows where we send the data directly through the message instead of MINIO or another data mechanism. However, we still need to address the Only use Pydantic for serializing and deserializing application/json Content-Types in our own library
bulletpoint for MINIO or other data mechanisms to fully work.
Currently, internal INTERSECT logic does not utilize protocol headers at all, but serializes metadata and the message "payload" (the actual scientific data) into a single JSON blob. So an actual "message payload" is just JSON of one of the following three types:
One immediate problem this presents is if we want to send binary data over the wire, as not everything serializes into JSON (quick example: raw PNG images). The common solution for this is to encode the binary data as Base64, but this is not a viable approach for larger data given that Base64 inflates the size of the data considerably.
This is where allowing users to specify the Content-Type of their events comes in handy. We can directly pass this as a protocol header in actual messages. We can also use it to auto-reject messages when their Content-Type header does not match the expected Content-Type of the request operation.
Note that if the user is sending back a complex object (lists, BaseModels, dataclasses...), but the complex object uses binary data, it will probably be necessary to encode the binary value as a base64 string, and use the
contentMediaType
property in the schema to represent the true MIME type. Official JSON Schema docs here. An alternative users could implement for events would be to first emit a custom "metadata" typed INTERSECT event, then emit the associated "data" event. The "metadata" event would need to be able to reference the "data" event. (I think that the base64-encoding approach will probably still be necessary for response messages, though.) There are some Pydantic classes which can help with this.Protocols
We have a full list of protocols we want to support here. Here is a list of protocols supported by AsyncAPI officially,, note that the protocols specification in AsyncAPI is extensible and not limited to their definitions.
Protocols which support protocol-level headers
Not a complete list, may be inaccurate.
With all of these, it makes sense to first try to use established headers -
Content-Type
is a common header. If we can't find a common header, we can use anX-Intersect-SDK-
prefix value in the header.pika.BasicProperties
paho.mqtt.properties.Properties
Protocols which do NOT support protocol-level headers
Not a complete list, may be inaccurate.
Note that while it's not impossible to communicate across these protocols with non-JSON data, we would need to create our own header encoding logic. Since all metadata in headers can already be expressed as printable UTF-8 strings, the only binary data should be in the message payload itself. We can prohibit non-printable characters in header keys and values, and use specific control characters as separators for the various parts of the message.
Protocols with limited support for protocol-level headers
Not a complete list, may be inaccurate
event: metadata
). The associated data with this can look like whatever we want. If we want to send the metadata WITH the data, we must use custom encoding logic.Proposed action items
application/json
Content-Types in our own library. Otherwise, we just verify that the output value is in bytes/bytearray format.application/json
, we require the input/output fields to be eitherbyte
orbytearray
(note thatstr
assumes a UTF-8 encoding, and valid UTF-8 objects should always be serializable as JSON already). Users will need to perform the appropriate conversions with their preferred library - I do not think it would be a good idea to include tons of different libraries in the INTERSECT-SDK for binary formats. This still allows us to have a valid JSON schema which is generated. Do not allow users to specify non-printable characters in any Content-Type definition. (This is an interesting discussion regarding a media type regex, if we want to further restrict Content Types.)Note that these changes should be considered breaking.