cruise-automation / rbacsync

Automatically sync groups into Kubernetes RBAC
Apache License 2.0
240 stars 33 forks source link

RBACSync should not remove rolebindings when G Suite API fails #7

Closed dilyevsky closed 5 years ago

dilyevsky commented 5 years ago

Current Behavior:

When controller fails to obtain group members from group.Grouper due an error (e.g G Suite is down or unreachable), it will remove all associated rolebindings from all namespaces. Relevant line is https://github.com/cruise-automation/rbacsync/blob/50c6f451d448ad7ab26254c6fee70ab4f8ef5f61/pkg/controller/controller.go#L358 This failure mode was also observed in the wild.

Expected Behavior:

G Suite API errors should NOT cause removal of rolebindings that already exist for the members. They should only be removed if we can positively confirm the member is no longer in the group. Otherwise this may cause production outages when G Suite API suddenly becomes unavailable due to misconfiguration/connectivity issue/etc.

@stevvooe raised the concern that this may prevent self-healing in case attacker somehow escalates their privileges and creates an "out-of-band" rolebinding. Then they can disable G Suite API and prevent RBACSync from restoring order. I'd argue that we should tradeoff potentially heightened security for reliability because:

Proposed Solution:

Only evict rolebindings when we can positively confirm the associated members aren't in the group members list anymore (by issuing successful API call and examining membership list). Export Prometheus metrics for G Suite errors.

dustin-decker commented 5 years ago

I think the proposed solution is a good trade-off of security for reliability. There will certainly be blips in availability of the API in the future. There is definitely lower hanging fruit than the update+DOS attack.

stevvooe commented 5 years ago

Thanks for the report!

Another possibility for a fix would be a cache to protect against momentary upstream unavailability. We can have it stay operational for a period then eventually timeout the cache.

I also feel like we should report the recency of group updates on the status field of the config objects so that it can be monitored.

dilyevsky commented 5 years ago

For discoverability, the error that was reported by RBACSync pod:

Event(v1.ObjectReference{Kind:"RBACSyncConfig", Namespace:"default", Name:"namespace-bindings", UID:"b76e9626-0255-11e9-bfd1", APIVersion:"rbacsync.example.com/v1alpha", ResourceVersion:"37596907", FieldPath:""}): type: 'Warning' reason: 'BindingError' group gke-user@example.com lookup failed: unknown error fetching group members: googleapi: Error 403: Not Authorized to access this resource/api, forbidden
stevvooe commented 5 years ago

Do we have a proposed error behavior for the set of error codes? We need to differentiate permanent errors from temporary access errors. For example, what would be the behavior of a 404 vs a 403? Are there are cases where the upstream wants to remove access to the group information and have that reflected in rbacsync? We don't want to create cases where we can't control the behavior.

stevvooe commented 5 years ago

Note that we saw this behavior under timeouts, as well:

Event(v1.ObjectReference{Kind:"RBACSyncConfig", Namespace:"default", Name:"namespace-bindings", UID:"ff3f6641-0253-11e9-9985-42010ad4000b", APIVersion:"rbacsync.getcruise.com/v1alpha", ResourceVersion:"34555077", FieldPath:""}): type: 'Warning' reason: 'BindingError' group gke-group@getcruise.com lookup failed: unknown error fetching group members: context deadline exceeded
stevvooe commented 5 years ago

Closed by #14.