Review integration digest computation and projection to the integration container as env var

lburgazzoli commented 7 months ago

What happened?

Since an integration digest ends up begin projected as a container env var, including the traits when computing the digest has a number of side effects, as example by enabling/disabling the creation of a route or service, would lead to the container to be restarted which is not expected.

Steps to reproduce

No response

Relevant log output

No response

Camel K version

No response

squakez commented 7 months ago

More than a bug I think it's the way it is actually designed. I'd mark this issue as an enhancement instead. In any case I'm not sure how we should behave as most of the traits configuration changes must trigger a reconciliation cycle. Maybe we should mark those traits which should not be taken in consideration when calculating the digest?

lburgazzoli commented 7 months ago

Triggering a reconciliation is fine, but that should not lead to a disruptive change in any resource so maybe we need two different digests or rethink how the reconciliation is re-triggered

squakez commented 7 months ago

Then, maybe this is not affecting the Pod execution at all. Do you have any failing example, ie, when adding or removing a service without affecting anything else?

lburgazzoli commented 7 months ago

If you have an integration like this one (flow omitted because it does not matter much):

apiVersion: camel.apache.org/v1
kind: Integration
metadata:
  name: r
spec:
  flows:
    ...
  traits:
    service:
      enabled: true
      auto: false
    ingress:
      enabled: true
      auto: false

And then:

deploy it
change ingress.enable to false
deploy it

Then what happen is that the pod is restarted

➜ k get pods                                   
NAME                                READY   STATUS        RESTARTS   AGE
camel-k-operator-6897f796dd-5znkm   1/1     Running       0          22m
r-677dcbf7fc-h7wx6                  0/1     Terminating   0          22s
r-6976c6b54b-8lccc                  1/1     Running       0          2s

Then, ignoring some of the stuffs that are changing as as side effect of pod restart (uuid, dates, etc), the only difference is in the digest:

In this case my expectation would be that the ingress resource would have been created without affecting the pod execution.

squakez commented 7 months ago

Okey. Indeed, it's because the env template vars in pod changes. I've checked and it seems this var is not really used anywhere. I'll try to remove it at all and run some test to see if it's good to solve this problem.

lburgazzoli commented 7 months ago

The CAMEL_K_DIGEST env var was added to force the re-deploy of the pod in case of a config map changes, like if you only change the flow in an Integration, then the pod won't actually change, only the config map will so we had to add this variable to be sure the latest routes are taken into account

squakez commented 6 months ago

I had a look at this and the "problem" is that we use the digest as a mean to compare the previous status of the Integration with the actual one in order to trigger a reconciliation cycle. Ideally we should not use it at all, and we could reconcile every time an Integration change as the regeneration of resources belonging to the Integration is idempotent (either the same digest would be the same). However we watch a lot of other objects and the monitoring action could be called so many time degrading the operator performance.

The most immediate solution would be to store and compare 2 digest. One (holding all the configuration we have now) would be "responsible" to trigger the reconciliation cycle when a sensitive change happen. The new one would be a digest accounting only for those configuration that would require also a Pod restart (likely sources only).

@lburgazzoli wdyt?

lburgazzoli commented 6 months ago

I wonder if we have to store them at all:

the one for the pod is an ephemeral value that depends on the content of an integration and some external resources, I don't think there is any value in storing it in the status also because we are computing it for every reconciliation (at least, this is my understanding) and we can eventually compute it only when needed (i.e. we don't need to generate it for a number of transient phases)
we can optionally compute the digest of the previous and current object in the event filter (i.e. when the generation or phase have changed we don't have to compute it)
storing the checksum in the status may trigger additional spurious reconcile loops

Not sure this would cover all the cases, WDYT ?

squakez commented 6 months ago

* we can optionally compute the digest of the previous and current object in the event filter (i.e. when the generation or phase have changed we don't have to compute it)

This one was an option I've evaluated but we need to discard because we have other objects beside the Integration. Effectively, the main problem as I mentioned above is that we watch a lot of other resources beside the Integration, so, the only possible way to know the status before, is to have this information stored in the same resource for eventual comparison. When an Integration spec change we have a certain degree of control, but when the change happen in any resource bound to the Integration (ie, the Deployment), then, we need to recover such value and make a comparison. Alternatively we may recalculate from scratch every time as I commented previously: "[...] we could reconcile every time an Integration change as the regeneration of resources belonging to the Integration is idempotent (either the same digest would be the same). However we watch a lot of other objects and the monitoring action could be called so many time degrading the operator performance."

I am not able to forecast such a degradation though.

lburgazzoli commented 6 months ago

* we can optionally compute the digest of the previous and current object in the event filter (i.e. when the generation or phase have changed we don't have to compute it)
This one was an option I've evaluated but we need to discard because we have other objects beside the Integration. Effectively, the main problem as I mentioned above is that we watch a lot of other resources beside the Integration, so, the only possible way to know the status before, is to have this information stored in the same resource for eventual comparison. When an Integration spec change we have a certain degree of control, but when the change happen in any resource bound to the Integration (ie, the Deployment), then, we need to recover such value and make a comparison. Alternatively we may recalculate from scratch every time as I commented previously: "[...] we could reconcile every time an Integration change as the regeneration of resources belonging to the Integration is idempotent (either the same digest would be the same). However we watch a lot of other objects and the monitoring action could be called so many time degrading the operator performance."

I am not able to forecast such a degradation though.

is it true that if i.e. a deployment changes then we go to the same filter as the Integration resource ? it seems to be odd

squakez commented 6 months ago

is it true that if i.e. a deployment changes then we go to the same filter as the Integration resource ? it seems to be odd

Yes. If you change a value of any Integration managed object, the monitoring loop starts and it applies the trait for the given Integration phase (running). In the case of Deployment it would bring the values back to the ones calculated by the operator.

lburgazzoli commented 6 months ago

I know that a reconciliation is triggered but does the integration filter apply ? because if so, then the previous and current integration object would have been the same (i.e. same checksum), so the filter would "skip" the event isn't it ?

squakez commented 6 months ago

I know that a reconciliation is triggered but does the integration filter apply ? because if so, then the previous and current integration object would have been the same (i.e. same checksum), so the filter would "skip" the event isn't it ?

Yes, what it would happen is that the Integration would stick in the Running phase, so, only applying the traits expected for such execution. Conversely, if there is a change in the digest, the Integration would start from Initialization phase, passing through all the phases (rebuilding, if the kit is no longer matching the expected configuration). If we remove the digest "optimization", we'd need to run every time this second flow.

lburgazzoli commented 6 months ago

I know that a reconciliation is triggered but does the integration filter apply ? because if so, then the previous and current integration object would have been the same (i.e. same checksum), so the filter would "skip" the event isn't it ?

Yes, what it would happen is that the Integration would stick in the Running phase, so, only applying the traits expected for such execution. Conversely, if there is a change in the digest, the Integration would start from Initialization phase, passing through all the phases (rebuilding, if the kit is no longer matching the expected configuration). If we remove the digest "optimization", we'd need to run every time this second flow.

I don't think I follow 100% but it's been long time since I had a look at all those internals and I may be completely wrong, so please, bear with me:

if an owned resource changes, I don't think we should get back to the initialization phase, as none of the generated resources should change the behavior of an integration: it is the other way around, so we should re-apply the manifests
if the spec changes directly or indirectly (i.e when the integration platform / integration profile changes) then yes we must move the integration to the initialization phase so I'd rather try to improve on how we can detect such drifts.

Now I understand that this may require some more radical changes but in general I'd really like we move toward reducing our dependency on the integration status as it can quickly get out of date.

So @squakez for the time being, I'm fine with your proposed solution.

squakez commented 4 months ago

While working on #5491 I realized there is a very important downside of this way to detect a drift. We cannot change the calculation algorithm itself without creating a disruption when upgrading from one version to another. This is because the older digest was calculated with an older version of the algorithm, and it would differ from the new algorithm that may have introduced some change in doing the calculation, regardless of the Integration configuration.

apache / camel-k