Rework messaging handling to directly use protocol headers, support multiple content-types

Currently, internal INTERSECT logic does not utilize protocol headers at all, but serializes metadata and the message "payload" (the actual scientific data) into a single JSON blob. So an actual "message payload" is just JSON of one of the following three types:

UserspaceMessage
EventMessage
LifecycleMessage

One immediate problem this presents is if we want to send binary data over the wire, as not everything serializes into JSON (quick example: raw PNG images). The common solution for this is to encode the binary data as Base64, but this is not a viable approach for larger data given that Base64 inflates the size of the data considerably.

This is where allowing users to specify the Content-Type of their events comes in handy. We can directly pass this as a protocol header in actual messages. We can also use it to auto-reject messages when their Content-Type header does not match the expected Content-Type of the request operation.

Note that if the user is sending back a complex object (lists, BaseModels, dataclasses...), but the complex object uses binary data, it will probably be necessary to encode the binary value as a base64 string, and use the contentMediaType property in the schema to represent the true MIME type. Official JSON Schema docs here. An alternative users could implement for events would be to first emit a custom "metadata" typed INTERSECT event, then emit the associated "data" event. The "metadata" event would need to be able to reference the "data" event. (I think that the base64-encoding approach will probably still be necessary for response messages, though.) There are some Pydantic classes which can help with this.

Protocols

We have a full list of protocols we want to support here. Here is a list of protocols supported by AsyncAPI officially,, note that the protocols specification in AsyncAPI is extensible and not limited to their definitions.

Protocols which support protocol-level headers

Not a complete list, may be inaccurate.

With all of these, it makes sense to first try to use established headers - Content-Type is a common header. If we can't find a common header, we can use an X-Intersect-SDK- prefix value in the header.

AMQP 0.9.1 - see the section on basic fields at the reference, for Python check pika.BasicProperties
MQTT 5.0.0 - see section 3.1.2.11.8 in the MQTT v5 specification, in Python check paho.mqtt.properties.Properties
HTTP - check Mozilla docs. Note that support for Server-Sent Events will be more limited.
Pulsar - it looks like a Message object has a 'properties' field (API)

Protocols which do NOT support protocol-level headers

Not a complete list, may be inaccurate.

Note that while it's not impossible to communicate across these protocols with non-JSON data, we would need to create our own header encoding logic. Since all metadata in headers can already be expressed as printable UTF-8 strings, the only binary data should be in the message payload itself. We can prohibit non-printable characters in header keys and values, and use specific control characters as separators for the various parts of the message.

MQTT 3.x - user defined properties were introduced in MQTT v5
Redis

Protocols with limited support for protocol-level headers

Not a complete list, may be inaccurate

HTTP Server-Sent Events - the server would first need to send back a custom event type (i.e. event: metadata). The associated data with this can look like whatever we want. If we want to send the metadata WITH the data, we must use custom encoding logic.
WebSockets - the server will send headers back with the initial handshake, but will not send headers per message.

Proposed action items

[x] Only use Pydantic for serializing and deserializing application/json Content-Types in our own library. Otherwise, we just verify that the output value is in bytes/bytearray format.
[ ] Rework how we use Pydantic message classes. These are still okay for validating and serializing protocol-level messages, but there needs to be custom logic for each protocol we support.
[ ] Either drop support for protocols which don't support protocol-level headers, or write our own encoder/decoder (do NOT use JSON to do this).
[x] Add some custom validation logic for Content-Types - this is currently the only message header field where we need to allow complete flexibility. I would generally suggest that for any Content-Type other than application/json, we require the input/output fields to be either byte or bytearray (note that str assumes a UTF-8 encoding, and valid UTF-8 objects should always be serializable as JSON already). Users will need to perform the appropriate conversions with their preferred library - I do not think it would be a good idea to include tons of different libraries in the INTERSECT-SDK for binary formats. This still allows us to have a valid JSON schema which is generated. Do not allow users to specify non-printable characters in any Content-Type definition. (This is an interesting discussion regarding a media type regex, if we want to further restrict Content Types.)

Note that these changes should be considered breaking.

INTERSECT-SDK / python-sdk