Closed nicolaferraro closed 1 year ago
Maybe we can do something like this:
steps:
- marshal: "{{format}}"
This should be a reference to a in/out schema, the operator can then create the properties to configure the data format via properties
I like this idea and would add the possibility for:
Maybe we can do something like this:
steps: - marshal: "{{format}}"
This should be a reference to a in/out schema, the operator can then create the properties to configure the data format via properties
Yes, this would avoid having to write the implementation at runtime side, leaving also field to the user to implement custom transformations in the flow.
I like this idea and would add the possibility for:
- references to external schemas (I.e. in schema registry)
- Schemas attached to the message in transit (I.e in header)
Yeah, these are concerns we need to address now as well. The schema
prop is set to JSONSchema currently, but we need to address other kinds of schemas, including schemas located elsewhere.
I think it's a good time to deprecate spec
-> types
and provide something like spec
-> dataTypes
just to build a different tree and provide oob migration from old Kamelets.
For the "schema in header", I think it's a good idea for sources. We can make sure the operator passes the location of the schema in a configuration property and, in case the schema is inline, it also mount it as a file in the pod, so that the header can always be an URL. The Kamelet runtime may also bind that property into a header. The destination (or an intermediate step) can the use that URL to do stuff.
Wdyt @lburgazzoli ?
Let me have the layman in me try summarising the discussion (at least for my brain to wrap this up 🤯):
Is my understanding correct?
I like this idea and would add the possibility for:
- references to external schemas (I.e. in schema registry)
- Schemas attached to the message in transit (I.e in header)
Yeah, these are concerns we need to address now as well. The
schema
prop is set to JSONSchema currently, but we need to address other kinds of schemas, including schemas located elsewhere.I think it's a good time to deprecate
spec
->types
and provide something likespec
->dataTypes
just to build a different tree and provide oob migration from old Kamelets.
Maybe we can use "schemes" instead.
For the "schema in header", I think it's a good idea for sources. We can make sure the operator passes the location of the schema in a configuration property and, in case the schema is inline, it also mount it as a file in the pod, so that the header can always be an URL. The Kamelet runtime may also bind that property into a header. The destination (or an intermediate step) can the use that URL to do stuff.
I think we can improve data formats in general, as example:
We can then define a specific schema like:
avro:
media-type: application/avro
schema:
# the avro schema inline|reference
data-format:
# optional, if not provided use the scheme id
id: "avro"
properties:
class-name: org.apache.camel.xxx.MyClass
compute-schema: true|false
...
dependencies:
- camel-avro
- mvn:org.acme/my-artifact/1.0.0
Wdyt @lburgazzoli ?
Let me have the layman in me try summarising the discussion (at least for my brain to wrap this up exploding_head):
* We integrate the DataFormat concept into the Kamelet model, * The Kamelet writer can specify a set of DataFormats, and it's up to her/him to declare a property in the Kamelet schema, and use that property in the Kamelet Flow to do the IO conversion, * The Camel K runtime provides (auto-)configuration of DataFormats, so that the Kamelet writer can reference them by name in the Kamelet Flow, * We need to revisit how to associate schemas to DataFormats.
Is my understanding correct?
looks correct :)
+1 then 😄!
Thinking a little bit more, wonder if this new schema/data-format thing is something we can define as a dedicated custom resource (which we can eventually embed in the kamelet) but could also be something we could use for the dynamic computation of schemes.
Eventually, schema registries can watch and duck type those resource to automatically load them.
Interesting ideas. I like the concept and I think we should understand if/how is possible to externalize such data formats. If the data formats are an external entity they could be reusable and keep the Kamelet
design simpler. Finally, if you think about it, a format is just a view of the data, so it makes sense not to be part of the logic.
Yeah good idea to have support for multiple data types, especially when its common on kafka land to have avro, json, types.
For kamelets it would also be good if we could generate documentation (ascii doc files) to use for the website / kamelet repository. And in that documentation we can then easily grab the data types and prominently show in the docs what types are supported.
Btw do we have any thoughts on schema-less kamelets? For example if you just use a kamelet to route data from one messaging system to another between queues. And dont really want/need to specify any schema, as the data is just "raw".
Btw do we have any thoughts on schema-less kamelets? For example if you just use a kamelet to route data from one messaging system to another between queues. And dont really want/need to specify any schema, as the data is just "raw".
Yep schema is not always required and to be honest for Camel it may not even needed (except for some components like kafka) so it is mainly a tooling related information.
Let's do another iteration on this...
I'm thinking to your comments and I like the idea of having stuff also as CRs. I remember some brainstorming with @lburgazzoli about how dynamic schemas may work in this model. The idea was to let Kamelets define their schemas, if known in advance, but also let KameletBindings redefine them, if needed.
DataFormats are generic in Camel, but when talking about connectors (a.k.a. Kamelets), I think it's better for the Kamelet to enumerate all the possible dataformats it supports. E.g. @davsclaus was talking about sources that can only produce binary
data (i.e. no dataformat), but there are many other examples: e.g. a "hello world" string cannot be transformed into FHIR data by simply plugging the FHIR JSON dataformat, as well as not all data is suitable for CSV encoding..
I also see that we're talking about formats and schemas as if they were the same thing, but even if they are related (i.e. dataFormat + Kamelet [+ Binding Properties] may imply a Schema), maybe we can do a better job in treating them as separate entities.
I think the following model may be good for the in-Kamelet specification of a "format":
kind: Kamelet
apiVersion: camel.apache.org/v1alpha1
metadata:
name: chuck-source
# ...
spec:
definition:
properties:
format:
title: Format
type: string
enum:
- JSON
- Avro
default: JSON
# ...
formats:
- name: JSON
# optional, useful in case of in/out Kamelets
scope: out
schema:
mediaType: "application/json"
data: # the JSON schema inline
url: # alternative link to the shema
ref: # alternative Kubernetes reference to the schema (see below)
name: # ...
# the source produces JSON by default, no libs or transformations needed
- name: Avro
schema:
type: avro-schema
mediaType: "application/avro"
data: # the avro schema inline
url: # alternative link to the schema
ref: # alternative Kubernetes reference to the schema (see below)
name: # ...
dataFormat:
# optional, but if not provided "no format" is assumed
id: "avro"
properties: # only if "id" is present
class-name: org.apache.camel.xxx.MyClass
compute-schema: true|false
# ...
dependencies:
- camel:jackson
- camel:avro
- mvn:org.acme/my-artifact/1.0.0
You can notice the scope
property that allows to define the specific details of transformations for input and output of a particular format. I'd not complicate life and assume that users will choose only 1 format using the standard format
property (not an inputFormat
and outputFormat
). So if I choose CSV
, the Kamelet will consume and produce CSV. Anyway, the shape (schema) of the input CSV can be different from the one of the output CSV (and that's described in the Kamelet).
The schema
here is declared inline in the Kamelet, to make it self-contained, but we can create also a Schema
CR:
kind: Schema
apiVersion: camel.apache.org/v1alpha1
metadata:
name: my-avro-schema
spec:
type: avro-schema
mediaType: application/avro
data: # the avro schema inline
url: # alternative URL reference
# no, ref is forbidden here
Structure is almost the same as the inline version.
The binding can use the predefined schema:
kind: KameletBinding
apiVersion: camel.apache.org/v1alpha1
metadata:
name: chuck-to-channel
spec:
source:
kind: Kamelet
apiVersion: camel.apache.org/v1alpha1
name: chuck-source
properties:
# may have been omitted, since it's the default
format: JSON
sink:
# ...
The binding above will produce objects in JSON format with the inline definition of the schema. The one below is using a custom schema:
kind: KameletBinding
apiVersion: camel.apache.org/v1alpha1
metadata:
name: chuck-to-channel
spec:
source:
kind: Kamelet
apiVersion: camel.apache.org/v1alpha1
name: chuck-source
properties:
# since there's no inline format named "my-avro", it refers to the external one
format: Avro
schema:
# since it's a source, we assume this is the schema of the output
ref:
name: my-avro-schema
# or alternatively also inline
data: #...
url: # ...
sink:
# ...
This mechanism may be used also in cases where the schema can be computed dynamically before running the integration. In this case, an external entity saves the schema in a CR and references it in the KameletBinding.
For the use case of using the Schema CR to sync external entities (like registries), it's possible, but we should think more about that because of edge cases: sometimes the schema is known only at runtime and sometimes it varies from message to message. In that cases, it's the integration itself that needs to update the registries. Probably it would be cleaner if it's the integration that always updates the registry.
Let's do another iteration on this...
I'm thinking to your comments and I like the idea of having stuff also as CRs. I remember some brainstorming with @lburgazzoli about how dynamic schemas may work in this model. The idea was to let Kamelets define their schemas, if known in advance, but also let KameletBindings redefine them, if needed.
DataFormats are generic in Camel, but when talking about connectors (a.k.a. Kamelets), I think it's better for the Kamelet to enumerate all the possible dataformats it supports. E.g. @davsclaus was talking about sources that can only produce
binary
data (i.e. no dataformat), but there are many other examples: e.g. a "hello world" string cannot be transformed into FHIR data by simply plugging the FHIR JSON dataformat, as well as not all data is suitable for CSV encoding..I also see that we're talking about formats and schemas as if they were the same thing, but even if they are related (i.e. dataFormat + Kamelet [+ Binding Properties] may imply a Schema), maybe we can do a better job in treating them as separate entities.
I think the following model may be good for the in-Kamelet specification of a "format":
kind: Kamelet apiVersion: camel.apache.org/v1alpha1 metadata: name: chuck-source # ... spec: definition: properties: format: title: Format type: string enum: - JSON - Avro default: JSON # ... formats: - name: JSON # optional, useful in case of in/out Kamelets scope: out schema: mediaType: "application/json" data: # the JSON schema inline url: # alternative link to the shema ref: # alternative Kubernetes reference to the schema (see below) name: # ... # the source produces JSON by default, no libs or transformations needed - name: Avro schema: type: avro-schema mediaType: "application/avro" data: # the avro schema inline url: # alternative link to the schema ref: # alternative Kubernetes reference to the schema (see below) name: # ... dataFormat: # optional, but if not provided "no format" is assumed id: "avro" properties: # only if "id" is present class-name: org.apache.camel.xxx.MyClass compute-schema: true|false # ... dependencies: - camel:jackson - camel:avro - mvn:org.acme/my-artifact/1.0.0
You can notice the
scope
property that allows to define the specific details of transformations for input and output of a particular format. I'd not complicate life and assume that users will choose only 1 format using the standardformat
property (not aninputFormat
andoutputFormat
). So if I chooseCSV
, the Kamelet will consume and produce CSV. Anyway, the shape (schema) of the input CSV can be different from the one of the output CSV (and that's described in the Kamelet).
I think we could also have a case where we want the data format to automatically compute the schema i.e. from a pojo, so basically a formats whiteout the schema
section.
The
schema
here is declared inline in the Kamelet, to make it self-contained, but we can create also aSchema
CR:kind: Schema apiVersion: camel.apache.org/v1alpha1 metadata: name: my-avro-schema spec: type: avro-schema mediaType: application/avro data: # the avro schema inline url: # alternative URL reference # no, ref is forbidden here
Structure is almost the same as the inline version.
The binding can use the predefined schema:
kind: KameletBinding apiVersion: camel.apache.org/v1alpha1 metadata: name: chuck-to-channel spec: source: kind: Kamelet apiVersion: camel.apache.org/v1alpha1 name: chuck-source properties: # may have been omitted, since it's the default format: JSON sink: # ...
The binding above will produce objects in JSON format with the inline definition of the schema. The one below is using a custom schema:
kind: KameletBinding apiVersion: camel.apache.org/v1alpha1 metadata: name: chuck-to-channel spec: source: kind: Kamelet apiVersion: camel.apache.org/v1alpha1 name: chuck-source properties: # since there's no inline format named "my-avro", it refers to the external one format: Avro schema: # since it's a source, we assume this is the schema of the output ref: name: my-avro-schema # or alternatively also inline data: #... url: # ... sink: # ...
This mechanism may be used also in cases where the schema can be computed dynamically before running the integration. In this case, an external entity saves the schema in a CR and references it in the KameletBinding.
For the use case of using the Schema CR to sync external entities (like registries), it's possible, but we should think more about that because of edge cases: sometimes the schema is known only at runtime and sometimes it varies from message to message. In that cases, it's the integration itself that needs to update the registries. Probably it would be cleaner if it's the integration that always updates the registry.
Yep, we don't need to publish each schema up-front but for pre-computed scheme (either because they are known at runtime or because they are computed before running the integration), we should store them as CR so other can eventually consume them.
I guess there may be some confusion from a user pov as you can define multiple in and multiple out schemes, how do we validate that ? Having an in/out formats separations would allow to define such semantic and validation, at CRD level.
This may also work the other way around, if an external tool creates a CR with the schema, then camel-k can consume it without the need to generate it.
But I agree, this is low priority.
Yeah, the schema associated inline with the format was intended to be optional, present only if known in advance.
When we think about sources I think there's no confusion: user chooses a format
during binding and that's it. In general, we should validate there's only one dataFormat per <scope, format> pair.
The problem arises when you think to sinks: a Telegram sink may accept an image, a video, a text or a structured JSON.
We can let the user choose the format
at binding time for the moment. But we can also think in the future to allow selecting all of them, or just a subset. But in general a sink may have multiple input types, that can be disambiguated at runtime via the mediaType
.
This issue has been automatically marked as stale due to 90 days of inactivity. It will be closed if no further activity occurs within 15 days. If you think that’s incorrect or the issue should never stale, please simply write any comment. Thanks for your contributions!
Seeking for help on improving the Kamelet model before going full on the Kamelet catalog effort.
Currently the model expects that one declares the default input/output of a Kamelet in the
spec
->types
->in
/out
field, like:The same for Kamelets that consume an input, but the property is named
in
.The meaning of those types is simply stated:
out
in
That has unfortunately some drawbacks, one of which is that a Kamelet must have a single data type as output (for sources) and/or a single data type for input. Many implementations of Kamelets that produce JSON data, in fact, have the following route snippet in the flow part so far:
So e.g., if we go full with the Kamelet catalog and add support for them in camel-kafka-connector, I expect soon to have a
salesforce-source-json
and asalesforce-source-avro
to overcome this limitation. But it's not ideal.I think we should allow a Kamelet to have a default input/output format, without forcing users to use that one: they may have choices.
I was thinking to something like this:
The
dataFormat
option tells the operator to automatically addcamel:jackson
and the marshalling step when the Kamelet is used in a KameletBinding.For
in
, this translates into:This should add the unmarshalling to a specific (optional) class.
In case we want this behavior to be common for KameletBinding and standard integration, this should be better implemented at runtime level.
Now the question is how to deal with the case of multiple input/output data types.
A possibility would be to add another level of description:
That would break a bit the current schema, but it will provide more options in the future.
Having the possibility to choose, a user can specify the format option in a KameletBinding (that we're going to reserve, like we did for id), to select an input/output format that is different from the default (maybe including
none
, to obtain the original data in advanced use cases).In case this should work also in the standard integration, we may use the following syntax:
From the operator side, the required libraries for Avro will be added, but the runtime should enhance the route with a loader/customizer.
Wdyt @lburgazzoli, @astefanutti , @davsclaus, @squakez