PaloAltoNetworks / prisma-cloud-compute-operator

15 stars 22 forks source link

Better Operator Failure Handling #11

Open mjnagel opened 3 years ago

mjnagel commented 3 years ago

DISCLAIMER: Wasn't sure where to file this, it somewhat seems like a bug but may be more of a feature request.

Current behavior

There's a couple things at play here that all revolve around the status of a deployed resource and the behavior on failure. I refer to ConsoleDefender in most of this issue since that is what I am working with, but I assume these issues/improvements exist/could be made on the other objects as well.

When I deploy a ConsoleDefender and it fails on one of the tasks, let's say for example "Create Defender YAML file": the operator outputs the failure log, then starts running the tasks again from the start. This can make it difficult to parse/follow the log output.

Additionally there is no ability (that I can see) to grab a status from the ConsoleDefender object, the only indication I have of failure for most tasks is through viewing the logs which as mentioned is tricky to do.

Finally, there appears to be poor handling of required values. I have dug myself into a lot of holes while trying to get an initial deployment going because I was missing one of the required pieces of the spec, but there was no validation done at "kubectl apply" time that blocked me from applying it.

Steps to reproduce

A couple scenarios to try:

The "poor validation" issue:

  1. Deploy the operator
  2. Deploy a ConsoleDefender that is missing the orchestrator value
  3. Notice that on the surface from a view of your cluster everything looks fine: The ConsoleDefender got created despite having missing pieces of the spec.

The "lack of status" issue:

  1. Deploy the operator
  2. Deploy a ConsoleDefender that is missing the accessToken value
  3. Notice that again on the surface everything looks fine. This time the console pod goes to running as well, which might mislead sometime to think that everything is working fine despite the admin account/license not being set up.
  4. Validate that there is no way to check status aside from the logs (I did a yaml dump of the ConsoleDefender and there is no status field).

The "messy logs" issue:

  1. Follow the above steps 1-4
  2. Follow the logs of the operator pod
  3. You should see that it continually loops repeating all of the tasks despite failure.

Possible solutions

A couple suggestions:

ctrought commented 2 years ago

Have the same thoughts. I am not sure if some of these pain points are a result of it being an ansible based operator and there being less flexibility available within the framework for it as opposed to a full golang one (which are much more common from what I have observed), or maybe it is just in its early days of development. We definitely look forward seeing continued development activity and improvements made to this operator.