INTERSECT-SDK / python-sdk

Interconnected Science Ecosystem - Software Development Kit (INTERSECT-SDK)
https://intersect-python-sdk.readthedocs.io
BSD 3-Clause "New" or "Revised" License
4 stars 1 forks source link

Rework messaging handling to directly use protocol headers, support multiple content-types #8

Open Lance-Drane opened 4 months ago

Lance-Drane commented 4 months ago

Currently, internal INTERSECT logic does not utilize protocol headers at all, but serializes metadata and the message "payload" (the actual scientific data) into a single JSON blob. So an actual "message payload" is just JSON of one of the following three types:

One immediate problem this presents is if we want to send binary data over the wire, as not everything serializes into JSON (quick example: raw PNG images). The common solution for this is to encode the binary data as Base64, but this is not a viable approach for larger data given that Base64 inflates the size of the data considerably.

This is where allowing users to specify the Content-Type of their events comes in handy. We can directly pass this as a protocol header in actual messages. We can also use it to auto-reject messages when their Content-Type header does not match the expected Content-Type of the request operation.

Note that if the user is sending back a complex object (lists, BaseModels, dataclasses...), but the complex object uses binary data, it will probably be necessary to encode the binary value as a base64 string, and use the contentMediaType property in the schema to represent the true MIME type. Official JSON Schema docs here. An alternative users could implement for events would be to first emit a custom "metadata" typed INTERSECT event, then emit the associated "data" event. The "metadata" event would need to be able to reference the "data" event. (I think that the base64-encoding approach will probably still be necessary for response messages, though.) There are some Pydantic classes which can help with this.

Protocols

We have a full list of protocols we want to support here. Here is a list of protocols supported by AsyncAPI officially,, note that the protocols specification in AsyncAPI is extensible and not limited to their definitions.

Protocols which support protocol-level headers

Not a complete list, may be inaccurate.

With all of these, it makes sense to first try to use established headers - Content-Type is a common header. If we can't find a common header, we can use an X-Intersect-SDK- prefix value in the header.

Protocols which do NOT support protocol-level headers

Not a complete list, may be inaccurate.

Note that while it's not impossible to communicate across these protocols with non-JSON data, we would need to create our own header encoding logic. Since all metadata in headers can already be expressed as printable UTF-8 strings, the only binary data should be in the message payload itself. We can prohibit non-printable characters in header keys and values, and use specific control characters as separators for the various parts of the message.

Protocols with limited support for protocol-level headers

Not a complete list, may be inaccurate

Proposed action items

Note that these changes should be considered breaking.

Lance-Drane commented 3 months ago

Note that most of this discussion applies only to the workflows where we send the data directly through the message instead of MINIO or another data mechanism. However, we still need to address the Only use Pydantic for serializing and deserializing application/json Content-Types in our own library bulletpoint for MINIO or other data mechanisms to fully work.