Better Operator Failure Handling

DISCLAIMER: Wasn't sure where to file this, it somewhat seems like a bug but may be more of a feature request.

Current behavior

There's a couple things at play here that all revolve around the status of a deployed resource and the behavior on failure. I refer to ConsoleDefender in most of this issue since that is what I am working with, but I assume these issues/improvements exist/could be made on the other objects as well.

When I deploy a ConsoleDefender and it fails on one of the tasks, let's say for example "Create Defender YAML file": the operator outputs the failure log, then starts running the tasks again from the start. This can make it difficult to parse/follow the log output.

Additionally there is no ability (that I can see) to grab a status from the ConsoleDefender object, the only indication I have of failure for most tasks is through viewing the logs which as mentioned is tricky to do.

Finally, there appears to be poor handling of required values. I have dug myself into a lot of holes while trying to get an initial deployment going because I was missing one of the required pieces of the spec, but there was no validation done at "kubectl apply" time that blocked me from applying it.

Steps to reproduce

A couple scenarios to try:

The "poor validation" issue:

Deploy the operator
Deploy a ConsoleDefender that is missing the orchestrator value
Notice that on the surface from a view of your cluster everything looks fine: The ConsoleDefender got created despite having missing pieces of the spec.

The "lack of status" issue:

Deploy the operator
Deploy a ConsoleDefender that is missing the accessToken value
Notice that again on the surface everything looks fine. This time the console pod goes to running as well, which might mislead sometime to think that everything is working fine despite the admin account/license not being set up.
Validate that there is no way to check status aside from the logs (I did a yaml dump of the ConsoleDefender and there is no status field).

The "messy logs" issue:

Follow the above steps 1-4
Follow the logs of the operator pod
You should see that it continually loops repeating all of the tasks despite failure.

Possible solutions

A couple suggestions:

Exposing a status field on the ConsoleDefender that tells the current task and when necessary the most recent failed task + error "log"
Extend the current validation on CRDs to include the required construct on any fields that are required and will cause creation to fail when they don't exist
Implement a better way of handling failure for the operator: The operator should have some knowledge of what failures can be fixed with a retry and what failures should result in a pause until the ConsoleDefender is updated. Example of this: If the operator fails to deploy the license because it is invalid, it should not retry. On other failures where a retry might work, the operator should be smarter in how it retries - rather than repeating all tasks it should (in general) just retry the failed task, unless the ConsoleDefender has changed.

PaloAltoNetworks / prisma-cloud-compute-operator