Argo CD support needs some thought:- ec_service, enabled and removed flags are confusing and don't quite work

gilesknap commented 1 month ago

We have a number of issues with the current state of our deployment repo.

ec_service is a flag that tells us if ec will show this service - this means not all services are managed by ec but it is currently required to flag services that support reading global.enabled to scale down to 0.
as pointed out above enabled only works in services that support it.
enabled overrides are scattered around in the individual apps so that there is not one place to look for them
enabled overrides in sub apps have been demonstrated to disappear under some root app refresh scenarios
removed is a little confusing, appears to overlap with enabled and gives us no way to monitor what stopped services there are (without looking in the repo)

I propose the following for review:-

we drop ec_service and removed and only support enabled
we support enabled as a flag to each service in the apps.yaml file which defaults to true
we contrive to make enabled show up in the list of parameters to the root app so you may if you wish override them with the GUI
when enabled=false the app is not rendered in the apps helm template so the whole app disappears from the cluster
we make the ec tool aware of the git repo p47-deployment and have it parse the values.yaml
this means ec can show ALL services and show them as STOPPED when enable=false
- it means that the argo cd gui does not show stopped services but I think that is OK
- in exchange, ec is restored to being useful for managing all of the services in the services repo

We would still use the argo cd cli for the rest of the commands - just add the git repo for the list of stopped services.

BUT: now ec has access to that repo I see no reason not to add write capability to that repo so that we can restore the deploy and delete functions.

@coretl @marcelldls @DiamondJoseph @ZohebShaikh Please have a look at this and let me know your thoughts. Thanks.

gilesknap commented 1 month ago

@marcelldls regarding the appearance and override of enabled flag in the root app. I believe I had it working at some point as described in the 1st 3 'proposed' bullets above so happy to try and reproduce that.

DiamondJoseph commented 1 month ago

I think DAQ's use case is to only make changes to ixx-deployment when adding a new service. Our services should always point to ixx-services main, and that repository alone can represent the state of the services. Each of our charts will have an enabled in their values.yaml and we manage it there. Just trying to have as little bespokeness as possible.

marcelldls commented 1 month ago

@gilesknap Your proposal is that any services that use our repos must implement a global.enabled?

gilesknap commented 1 month ago

@gilesknap Your proposal is that any services that use our repos must implement a global.enabled?

No. The proposal is the enabled is understood by apps/template to mean don't render that app into the sub-apps manifest, so argocd will remove it. However because ec looks at the repo it can see and show the stopped service.

gilesknap commented 1 month ago

So this means any service can be stopped (removed from the cluster) by setting enabled=false and control of stopping / starting services is entirely controlled centrally. @DiamondJoseph if you want the deployment repo to represent the state of the cluster why not let it do the starting and stopping - your approach is only needed if you really want to have the other resources deployed but scale to zero, I believe.

marcelldls commented 1 month ago

@gilesknap Your proposal is that any services that use our repos must implement a global.enabled?

No. The proposal is the enabled is understood by apps/template to mean don't render that app into the sub-apps manifest, so argocd will remove it. However because ec looks at the repo it can see and show the stopped service.

I see. This is already what removed does? Your proposal is to get rid of enabled, ec-service? In short, we drop scaling?

gilesknap commented 1 month ago

Yes we drop scaling as its a chart specific feature and I'd like to support any chart.

marcelldls commented 1 month ago

Yes we drop scaling as its a chart specific feature and I'd like to support any chart.

Im happy with the proposed change - the iteration before what we have now the replica count would be set by Override root app > Override app > app values > app chart default > service values > root service values > service chart default which was even more "scattered" then now.

The impact is pvcs must be outside of your service chart? And start up performance?

Can we revert the global in ec-services helm chart for a fixed replica count?

coretl commented 1 month ago

we make the ec tool aware of the git repo p47-deployment and have it parse the values.yaml

Does it need to know about git? Couldn't it get this info from the argo cli? If it knew from argo cli what the enabled value is, and if it is overridden, then this would tell it the difference between:

Enabled
Temporarily Enabled
Disabled
Temporarily Disabled

gilesknap commented 1 month ago

This seems like a fine idea. But.

I can't find any way to get the value of a parameter from the CLI only to set an override for one
There is still the question of enabling 'deploy' from ec

@marcelldls you know the CLI better than I, can you get a parameter value?

re the second point. I'm struggling with the sense that ec is crippled and the workflow is more laborious when you switch to argocd. Particularly noticeable when I'm intending to write a helm based tutorial first and then layer an argo cd one on top. It feels like you have hit the sweet spot with helm and then gone too far with argo cd.

If we give ec / argo cd parity of function with ec / helm, then:

we do require access to the git repo (including write access for people with permissions)
we have a quick and less error prone way to update the deployment repo
we still retain the features of argo cd brings to the mix
- complete record of the state of the beamline
- version control managed explicitly by a FOSS tool instead of inside our ec helm backend.
- nice UI which can be used for overrides as well as ec
- completely in control using just the deployment repo if you prefer not to use ec

coretl commented 1 month ago

re the second point. I'm struggling with the sense that ec is crippled and the workflow is more laborious when you switch to argocd. Particularly noticeable when I'm intending to write a helm based tutorial first and then layer an argo cd one on top. It feels like you have hit the sweet spot with helm and then gone too far with argo cd.

Good point, we like the synergy with configure-ioc, which would mean needing ec to be able to write to the p47-deployment git repo to upgrade/downgrade versions from the commanline

epics-containers / p45-deployment

Argo CD support needs some thought:- ec_service, enabled and removed flags are confusing and don't quite work #11