Add support for multiple data types and schemas in Kamelets

nicolaferraro commented 3 years ago

Seeking for help on improving the Kamelet model before going full on the Kamelet catalog effort.

Currently the model expects that one declares the default input/output of a Kamelet in the spec -> types -> in / out field, like:

# ...
  types:
    out:
      mediaType: application/json
      schema:
        #...

The same for Kamelets that consume an input, but the property is named in.

The meaning of those types is simply stated:

A Kamelet produces output with the format specified in out
A Kamelet consumes an input with the format specified in in

That has unfortunately some drawbacks, one of which is that a Kamelet must have a single data type as output (for sources) and/or a single data type for input. Many implementations of Kamelets that produce JSON data, in fact, have the following route snippet in the flow part so far:

# ...
      steps:
      - marshal:
          json: {}

So e.g., if we go full with the Kamelet catalog and add support for them in camel-kafka-connector, I expect soon to have a salesforce-source-json and a salesforce-source-avro to overcome this limitation. But it's not ideal.

I think we should allow a Kamelet to have a default input/output format, without forcing users to use that one: they may have choices.

I was thinking to something like this:

# ...
  types:
    out:
      mediaType: application/json
      schema:
        #...
      camel:
        dataFormat: json-jackson

The dataFormat option tells the operator to automatically add camel:jackson and the marshalling step when the Kamelet is used in a KameletBinding.

For in, this translates into:

# ...
  types:
    in:
      mediaType: application/json
      schema:
        #...
      camel:
        dataFormat: json-jackson
        className: org.apache.camel.xxx.MyClass

This should add the unmarshalling to a specific (optional) class.

In case we want this behavior to be common for KameletBinding and standard integration, this should be better implemented at runtime level.

Now the question is how to deal with the case of multiple input/output data types.

A possibility would be to add another level of description:

# ...
  types:
    out:
      default: json
      json:
        mediaType: application/json
        schema:
          # the JSON schema
        camel:
          dataFormat: json-jackson
      avro:
        mediaType: application/avro
        schema:
          # the Avro schema
        camel:
          dataFormat: avro
          className: org.apache.camel.xxx.MyClass

That would break a bit the current schema, but it will provide more options in the future.

Having the possibility to choose, a user can specify the format option in a KameletBinding (that we're going to reserve, like we did for id), to select an input/output format that is different from the default (maybe including none, to obtain the original data in advanced use cases).

In case this should work also in the standard integration, we may use the following syntax:

from("kamelet:salesforce-source?format=avro").to("....")

From the operator side, the required libraries for Avro will be added, but the runtime should enhance the route with a loader/customizer.

Wdyt @lburgazzoli, @astefanutti , @davsclaus, @squakez

lburgazzoli commented 3 years ago

Maybe we can do something like this:

steps:
    - marshal: "{{format}}"

This should be a reference to a in/out schema, the operator can then create the properties to configure the data format via properties

heiko-braun commented 3 years ago

I like this idea and would add the possibility for:

references to external schemas (I.e. in schema registry)
Schemas attached to the message in transit (I.e in header)

nicolaferraro commented 3 years ago

Maybe we can do something like this:
steps:
    - marshal: "{{format}}"
This should be a reference to a in/out schema, the operator can then create the properties to configure the data format via properties

Yes, this would avoid having to write the implementation at runtime side, leaving also field to the user to implement custom transformations in the flow.

nicolaferraro commented 3 years ago

I like this idea and would add the possibility for:

references to external schemas (I.e. in schema registry)

Schemas attached to the message in transit (I.e in header)

Yeah, these are concerns we need to address now as well. The schema prop is set to JSONSchema currently, but we need to address other kinds of schemas, including schemas located elsewhere.

I think it's a good time to deprecate spec -> types and provide something like spec -> dataTypes just to build a different tree and provide oob migration from old Kamelets.

For the "schema in header", I think it's a good idea for sources. We can make sure the operator passes the location of the schema in a configuration property and, in case the schema is inline, it also mount it as a file in the pod, so that the header can always be an URL. The Kamelet runtime may also bind that property into a header. The destination (or an intermediate step) can the use that URL to do stuff.

Wdyt @lburgazzoli ?

astefanutti commented 3 years ago

Let me have the layman in me try summarising the discussion (at least for my brain to wrap this up 🤯):

We integrate the DataFormat concept into the Kamelet model,
The Kamelet writer can specify a set of DataFormats, and it's up to her/him to declare a property in the Kamelet schema, and use that property in the Kamelet Flow to do the IO conversion,
The Camel K runtime provides (auto-)configuration of DataFormats, so that the Kamelet writer can reference them by name in the Kamelet Flow,
We need to revisit how to associate schemas to DataFormats.

Is my understanding correct?

lburgazzoli commented 3 years ago

I like this idea and would add the possibility for:

references to external schemas (I.e. in schema registry)

Schemas attached to the message in transit (I.e in header)

Yeah, these are concerns we need to address now as well. The schema prop is set to JSONSchema currently, but we need to address other kinds of schemas, including schemas located elsewhere.

I think it's a good time to deprecate spec -> types and provide something like spec -> dataTypes just to build a different tree and provide oob migration from old Kamelets.

Maybe we can use "schemes" instead.

For the "schema in header", I think it's a good idea for sources. We can make sure the operator passes the location of the schema in a configuration property and, in case the schema is inline, it also mount it as a file in the pod, so that the header can always be an URL. The Kamelet runtime may also bind that property into a header. The destination (or an intermediate step) can the use that URL to do stuff.

I think we can improve data formats in general, as example:

they can automatically compute the schema at runtime if not provided and store the result in an header
they can use a provided scheme to validate that the marshalled/unmarshalled data conforms with the given schema

We can then define a specific schema like:

avro:
  media-type: application/avro
  schema:
    # the avro schema inline|reference
  data-format:
    # optional, if not provided use the scheme id 
    id: "avro"
    properties:
      class-name: org.apache.camel.xxx.MyClass
      compute-schema: true|false
      ...
  dependencies:
    - camel-avro
    - mvn:org.acme/my-artifact/1.0.0

Wdyt @lburgazzoli ?

lburgazzoli commented 3 years ago

Let me have the layman in me try summarising the discussion (at least for my brain to wrap this up exploding_head):

* We integrate the DataFormat concept into the Kamelet model,

* The Kamelet writer can specify a set of DataFormats, and it's up to her/him to declare a property in the Kamelet schema, and use that property in the Kamelet Flow to do the IO conversion,

* The Camel K runtime provides (auto-)configuration of DataFormats, so that the Kamelet writer can reference them by name in the Kamelet Flow,

* We need to revisit how to associate schemas to DataFormats.

Is my understanding correct?

looks correct :)

astefanutti commented 3 years ago

+1 then 😄!

lburgazzoli commented 3 years ago

Thinking a little bit more, wonder if this new schema/data-format thing is something we can define as a dedicated custom resource (which we can eventually embed in the kamelet) but could also be something we could use for the dynamic computation of schemes.

Eventually, schema registries can watch and duck type those resource to automatically load them.

squakez commented 3 years ago

Interesting ideas. I like the concept and I think we should understand if/how is possible to externalize such data formats. If the data formats are an external entity they could be reusable and keep the Kamelet design simpler. Finally, if you think about it, a format is just a view of the data, so it makes sense not to be part of the logic.

davsclaus commented 3 years ago

Yeah good idea to have support for multiple data types, especially when its common on kafka land to have avro, json, types.

For kamelets it would also be good if we could generate documentation (ascii doc files) to use for the website / kamelet repository. And in that documentation we can then easily grab the data types and prominently show in the docs what types are supported.

davsclaus commented 3 years ago

Btw do we have any thoughts on schema-less kamelets? For example if you just use a kamelet to route data from one messaging system to another between queues. And dont really want/need to specify any schema, as the data is just "raw".

lburgazzoli commented 3 years ago

Btw do we have any thoughts on schema-less kamelets? For example if you just use a kamelet to route data from one messaging system to another between queues. And dont really want/need to specify any schema, as the data is just "raw".

Yep schema is not always required and to be honest for Camel it may not even needed (except for some components like kafka) so it is mainly a tooling related information.

nicolaferraro commented 3 years ago

Let's do another iteration on this...

I'm thinking to your comments and I like the idea of having stuff also as CRs. I remember some brainstorming with @lburgazzoli about how dynamic schemas may work in this model. The idea was to let Kamelets define their schemas, if known in advance, but also let KameletBindings redefine them, if needed.

DataFormats are generic in Camel, but when talking about connectors (a.k.a. Kamelets), I think it's better for the Kamelet to enumerate all the possible dataformats it supports. E.g. @davsclaus was talking about sources that can only produce binary data (i.e. no dataformat), but there are many other examples: e.g. a "hello world" string cannot be transformed into FHIR data by simply plugging the FHIR JSON dataformat, as well as not all data is suitable for CSV encoding..

I also see that we're talking about formats and schemas as if they were the same thing, but even if they are related (i.e. dataFormat + Kamelet [+ Binding Properties] may imply a Schema), maybe we can do a better job in treating them as separate entities.

I think the following model may be good for the in-Kamelet specification of a "format":

kind: Kamelet
apiVersion: camel.apache.org/v1alpha1
metadata:
  name: chuck-source
# ... 
spec:
  definition:
    properties:
      format:
        title: Format
        type: string
        enum:
        - JSON
        - Avro
        default: JSON
# ... 
formats:
- name: JSON
  # optional, useful in case of in/out Kamelets
  scope: out
  schema:
    mediaType: "application/json"
    data: # the JSON schema inline
    url: # alternative link to the shema
    ref: # alternative Kubernetes reference to the schema (see below)
      name: # ...
  # the source produces JSON by default, no libs or transformations needed

- name: Avro
  schema:
    type: avro-schema
    mediaType: "application/avro"
    data: # the avro schema inline
    url: # alternative link to the schema
    ref: # alternative Kubernetes reference to the schema (see below)
      name: # ...
  dataFormat:
    # optional, but if not provided "no format" is assumed
    id: "avro"
    properties: # only if "id" is present
      class-name: org.apache.camel.xxx.MyClass
      compute-schema: true|false
      # ...
    dependencies:
    - camel:jackson
    - camel:avro
    - mvn:org.acme/my-artifact/1.0.0

You can notice the scope property that allows to define the specific details of transformations for input and output of a particular format. I'd not complicate life and assume that users will choose only 1 format using the standard format property (not an inputFormat and outputFormat). So if I choose CSV, the Kamelet will consume and produce CSV. Anyway, the shape (schema) of the input CSV can be different from the one of the output CSV (and that's described in the Kamelet).

The schema here is declared inline in the Kamelet, to make it self-contained, but we can create also a Schema CR:

kind: Schema
apiVersion: camel.apache.org/v1alpha1
metadata:
  name: my-avro-schema
spec:
  type: avro-schema
  mediaType: application/avro
  data: # the avro schema inline
  url: # alternative URL reference
  # no, ref is forbidden here

Structure is almost the same as the inline version.

The binding can use the predefined schema:

kind: KameletBinding
apiVersion: camel.apache.org/v1alpha1
metadata:
  name: chuck-to-channel
spec:
  source:
    kind: Kamelet
    apiVersion: camel.apache.org/v1alpha1
    name: chuck-source
    properties:
      # may have been omitted, since it's the default
      format: JSON
  sink:
    # ...

The binding above will produce objects in JSON format with the inline definition of the schema. The one below is using a custom schema:

kind: KameletBinding
apiVersion: camel.apache.org/v1alpha1
metadata:
  name: chuck-to-channel
spec:
  source:
    kind: Kamelet
    apiVersion: camel.apache.org/v1alpha1
    name: chuck-source
    properties:
      # since there's no inline format named "my-avro", it refers to the external one
      format: Avro
    schema:
      # since it's a source, we assume this is the schema of the output
      ref:
        name: my-avro-schema
      # or alternatively also inline
      data: #...
      url: # ...
  sink:
    # ...

This mechanism may be used also in cases where the schema can be computed dynamically before running the integration. In this case, an external entity saves the schema in a CR and references it in the KameletBinding.

For the use case of using the Schema CR to sync external entities (like registries), it's possible, but we should think more about that because of edge cases: sometimes the schema is known only at runtime and sometimes it varies from message to message. In that cases, it's the integration itself that needs to update the registries. Probably it would be cleaner if it's the integration that always updates the registry.

lburgazzoli commented 3 years ago

Let's do another iteration on this...

I'm thinking to your comments and I like the idea of having stuff also as CRs. I remember some brainstorming with @lburgazzoli about how dynamic schemas may work in this model. The idea was to let Kamelets define their schemas, if known in advance, but also let KameletBindings redefine them, if needed.

DataFormats are generic in Camel, but when talking about connectors (a.k.a. Kamelets), I think it's better for the Kamelet to enumerate all the possible dataformats it supports. E.g. @davsclaus was talking about sources that can only produce binary data (i.e. no dataformat), but there are many other examples: e.g. a "hello world" string cannot be transformed into FHIR data by simply plugging the FHIR JSON dataformat, as well as not all data is suitable for CSV encoding..

I also see that we're talking about formats and schemas as if they were the same thing, but even if they are related (i.e. dataFormat + Kamelet [+ Binding Properties] may imply a Schema), maybe we can do a better job in treating them as separate entities.

I think the following model may be good for the in-Kamelet specification of a "format":
kind: Kamelet
apiVersion: camel.apache.org/v1alpha1
metadata:
  name: chuck-source
# ... 
spec:
  definition:
    properties:
      format:
        title: Format
        type: string
        enum:
        - JSON
        - Avro
        default: JSON
# ... 
formats:
- name: JSON
  # optional, useful in case of in/out Kamelets
  scope: out
  schema:
    mediaType: "application/json"
    data: # the JSON schema inline
    url: # alternative link to the shema
    ref: # alternative Kubernetes reference to the schema (see below)
      name: # ...
  # the source produces JSON by default, no libs or transformations needed

- name: Avro
  schema:
    type: avro-schema
    mediaType: "application/avro"
    data: # the avro schema inline
    url: # alternative link to the schema
    ref: # alternative Kubernetes reference to the schema (see below)
      name: # ...
  dataFormat:
    # optional, but if not provided "no format" is assumed
    id: "avro"
    properties: # only if "id" is present
      class-name: org.apache.camel.xxx.MyClass
      compute-schema: true|false
      # ...
    dependencies:
    - camel:jackson
    - camel:avro
    - mvn:org.acme/my-artifact/1.0.0
You can notice the scope property that allows to define the specific details of transformations for input and output of a particular format. I'd not complicate life and assume that users will choose only 1 format using the standard format property (not an inputFormat and outputFormat). So if I choose CSV, the Kamelet will consume and produce CSV. Anyway, the shape (schema) of the input CSV can be different from the one of the output CSV (and that's described in the Kamelet).

I think we could also have a case where we want the data format to automatically compute the schema i.e. from a pojo, so basically a formats whiteout the schema section.

The schema here is declared inline in the Kamelet, to make it self-contained, but we can create also a Schema CR:
kind: Schema
apiVersion: camel.apache.org/v1alpha1
metadata:
  name: my-avro-schema
spec:
  type: avro-schema
  mediaType: application/avro
  data: # the avro schema inline
  url: # alternative URL reference
  # no, ref is forbidden here
Structure is almost the same as the inline version.

The binding can use the predefined schema:
kind: KameletBinding
apiVersion: camel.apache.org/v1alpha1
metadata:
  name: chuck-to-channel
spec:
  source:
    kind: Kamelet
    apiVersion: camel.apache.org/v1alpha1
    name: chuck-source
    properties:
      # may have been omitted, since it's the default
      format: JSON
  sink:
    # ...
The binding above will produce objects in JSON format with the inline definition of the schema. The one below is using a custom schema:
kind: KameletBinding
apiVersion: camel.apache.org/v1alpha1
metadata:
  name: chuck-to-channel
spec:
  source:
    kind: Kamelet
    apiVersion: camel.apache.org/v1alpha1
    name: chuck-source
    properties:
      # since there's no inline format named "my-avro", it refers to the external one
      format: Avro
    schema:
      # since it's a source, we assume this is the schema of the output
      ref:
        name: my-avro-schema
      # or alternatively also inline
      data: #...
      url: # ...
  sink:
    # ...
This mechanism may be used also in cases where the schema can be computed dynamically before running the integration. In this case, an external entity saves the schema in a CR and references it in the KameletBinding.

For the use case of using the Schema CR to sync external entities (like registries), it's possible, but we should think more about that because of edge cases: sometimes the schema is known only at runtime and sometimes it varies from message to message. In that cases, it's the integration itself that needs to update the registries. Probably it would be cleaner if it's the integration that always updates the registry.

Yep, we don't need to publish each schema up-front but for pre-computed scheme (either because they are known at runtime or because they are computed before running the integration), we should store them as CR so other can eventually consume them.

lburgazzoli commented 3 years ago

I guess there may be some confusion from a user pov as you can define multiple in and multiple out schemes, how do we validate that ? Having an in/out formats separations would allow to define such semantic and validation, at CRD level.

This may also work the other way around, if an external tool creates a CR with the schema, then camel-k can consume it without the need to generate it.

But I agree, this is low priority.

nicolaferraro commented 3 years ago

Yeah, the schema associated inline with the format was intended to be optional, present only if known in advance.

When we think about sources I think there's no confusion: user chooses a format during binding and that's it. In general, we should validate there's only one dataFormat per <scope, format> pair.

The problem arises when you think to sinks: a Telegram sink may accept an image, a video, a text or a structured JSON. We can let the user choose the format at binding time for the moment. But we can also think in the future to allow selecting all of them, or just a subset. But in general a sink may have multiple input types, that can be disambiguated at runtime via the mediaType.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale due to 90 days of inactivity. It will be closed if no further activity occurs within 15 days. If you think that’s incorrect or the issue should never stale, please simply write any comment. Thanks for your contributions!

apache / camel-k

Add support for multiple data types and schemas in Kamelets #1980