GoogleCloudPlatform / pubsec-declarative-toolkit

The GCP PubSec Declarative Toolkit is a collection of declarative solutions to help you on your Journey to Google Cloud. Solutions are designed using Config Connector and deployed using Config Controller.
Apache License 2.0
32 stars 28 forks source link

The gatekeeper solution deployment by kpt won't start #761

Open jacyang2010 opened 11 months ago

jacyang2010 commented 11 months ago

Describe the bug When deploy the gatekeeper solution given below via kpt by following the below guide, the deployment won't start and got some errors about CRDs required are uninstalled.

https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/tree/main/docs/landing-zone-v2

To Reproduce

Step1: Install and configure kcc config controller by the above given guide. Step2: Follow the below command to download and deploy the gatekeeper solution.

URL="https://raw.githubusercontent.com/GoogleCloudPlatform/pubsec-declarative-toolkit/main/.release-please-manifest.json"
PACKAGE="solutions/gatekeeper-policies"
VERSION=$(curl -s $URL | jq -r ".\"$PACKAGE\"")
kpt pkg get https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit.git/${PACKAGE}@${VERSION}
kpt live init gatekeeper-policies --namespace config-control
kpt fn render gatekeeper-policies
kpt live apply gatekeeper-policies --reconcile-timeout=2m --output=table

Step3: Found the gatekeeper solution deployment won't start due to missing CRDs as shown below.

Screenshot 2023-11-30 at 10 46 11 PM

Expected behavior A reliable solution and guide must be given so that we can install gatekeeper solution without any errors.

cartyc commented 11 months ago

This is due to the gatekeeper ConstraintTemplate needing to be installed into the Config Controller instance before they can be called by the constraints objects. kpt does not continually reconcile and leads to this error blocking deployment, if using config-sync this will eventually reconcile once the ConstraintTemplates get loaded.

In order to work-around this issue you will need to comment out the constraint objects and run a kpt live apply and once that has succeeded you can then uncomment the constraint's and re-run kpt live apply. The fix for this is being tracked in #414 . This PR separates the constraintTemplates and the constraints into their own packages in order to prevent the error that you are experiencing.

jacyang2010 commented 11 months ago

A better solution could be that, without separating of current yaml files, we can utilize the annotation config.kubernetes.io/depends-on to specific dependencies of resources so that kpt can know which resources should be applied first. The example code should be like below example.

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: DataLocation
metadata: # kpt-merge: /datalocation
  name: datalocation
  annotations:
    internal.kpt.dev/upstream-identifier: 'constraints.gatekeeper.sh|DataLocation|default|datalocation'
    config.kubernetes.io/depends-on: templates.gatekeeper.sh/ConstraintTemplate/datalocation # Look at here
...

Please refer to the below document from kpt for more details. https://kpt.dev/book/06-deploying-packages/03-handling-dependencies

jacyang2010 commented 11 months ago

Oops! Just realized that the kpt checks if each of resource types used in local yaml files can be found in the cluster or as CRDs among the applied resources right before actual deployment when we run kpt live apply command so we still have to move CRDs into a separated package or other upstream package and apply CRDs in advance as @cartyc mentioned above so far.

That is to say, the above given solution based on depends_on is not well implemented and/or supported by kpt. Hopefully, in the future, kpt could be enhanced to support package reconciliation during live apply deployment.

jacyang2010 commented 11 months ago

By the way, the same problem is found when use RootSync to deploy gatekeepr solution as shown below.

Screenshot 2023-12-13 at 5 58 05 PM
cartyc commented 11 months ago

Yes, that is accurate. The main difference between how config-sync works vs using kpt directly is that config-sync will reconcile eventually once the constraintTemplates get applied, this will result in the application showing errors for a few minutes while those objects get applied but it will end up in a healthy state within a few minutes.

The changes proposed in #414 will help reduce this issue as well.

jacyang2010 commented 11 months ago

Yes, that is accurate. The main difference between how config-sync works vs using kpt directly is that config-sync will reconcile eventually once the constraintTemplates get applied, this will result in the application showing errors for a few minutes while those objects get applied but it will end up in a healthy state within a few minutes.

The changes proposed in #414 will help reduce this issue as well.

I waited for around an hour and the issue did not go away automatically. I had to comment out constraints.yaml to let the root sync health first and then uncommented the constraints. I feel that, as for a kpt package with CRDs, Config-Sync(RootSync) and kpt have the same problem on installing CRDs before other resources.

cartyc commented 11 months ago

That's odd, anytime I've run this it typically finishes in a few minutes. The only time I've run into the described situation with config-sync is when I've forgetten to uncomment the constraintTemplates. Can you confirm that the template.yaml files are presented and uncommented in the git repo you are syncing with?

cartyc commented 11 months ago

The core issue with kpt has to do with some of the underlying kubectl libraries that it uses, either kyaml or client-go. I'll try to dig up the issue for better visibility for you.

cartyc commented 11 months ago

For reference this issue is also present in other gitops tools like flux. It's sadly a chicken and egg situation and there's not an elegant way to deal with it at this time other then to order the packages or rely on eventual reconciliation to do it's magic. It might be possible to write a kpt function to help handle this use case though.

jacyang2010 commented 11 months ago

It finally got working when commented the constraints and waited till the RootSync become health and then uncommented the same constraints out. That solution technically works but that not an elegant and professional solution. The solution #414 (moving the CRDs into a upstream package or initialization package) looks like a best one so far.