[RFC] Use Update Driven Refresh for Pods

agrare commented 6 years ago

Problem

Currently only full refresh is supported for container providers (Kubernetes/Openshift), with sufficiently large environments this refresh can take over 2 hours. This is long enough that pods/containers can be created and deleted while a refresh is running causing them to be completely missed by ManageIQ.

Without a record of all pods which were created policy actions cannot be run and metrics cannot be collected for chargeback.

Proposed Solution

Kubernetes supports a stream update mechanism /watch which delivers changes to a registered client. There is an example in the kubeclient repo: https://github.com/abonas/kubeclient#receive-entity-updates

We propose adding a new worker (InventoryCollectorWorker) which registers for these WatchStreams specifically for pods and sends ManagerRefresh::Target targets with the payload to the RefreshWorker for parsing and saving. Since all updates are persisted in the queue and will be handled by the refresh worker no pod will be missed.

In addition to maintaining a record of all pods which were created&deleted we can collect metrics on recently disconnected pods ensuring we have metrics for these short lived containers.

PRs

[x] ManageIQ/manageiq#16198 - This adds the base worker class for an InventoryCollectorWorker
[x] ManageIQ/manageiq-providers-kubernetes#129 - This contains the worker mixin which actually subscribes to the watches and sends the targets
[x] ManageIQ/manageiq-providers-openshift#52 - Just adds the Openshift worker based on the Kubernetes mixin
[x] ManageIQ/manageiq-providers-kubernetes#135 - Adds support for targeted pod refresh
[x] ManageIQ/manageiq-providers-openshift#54 - Openshift equivalent
[x] https://github.com/ManageIQ/manageiq/pull/16311 - Add worker classes for kubernetes and openshift

cc @Fryguy @Ladas @kbrock @simon3z

Moved from: https://github.com/ManageIQ/manageiq/issues/16240

agrare commented 6 years ago

Issues still to be worked out:

[ ] How does MiqQueue handle this many targets? Can we improve it by sending targets in batches and moving the payload to data instead of args? << @Ladas

Fryguy commented 6 years ago

@agrare As discussed offline, we should treat any realtime watcher the way we do events or the vmware watcher, and that is that there should probably be 2 threads...one that watches and puts the raw data on an internal in-memory queue, and a second thread that reads from that queue and writes to the database (probably to MiqQueue). With an internal queue you also have the advantage of batching things up, so you could write to the MiqQueue every 5 seconds instead and send the entirety of what was seen in a 5 second period. This would prevent the harsh one-by-one slamming of the MiqQueue.

agrare commented 6 years ago

MVP for this has been merged.

agrare commented 6 years ago

Still needs to be completed:

[ ] Store ManagerRefresh::Target payload in BinaryBlob table /cc @Ladas

ManageIQ / manageiq-design

[RFC] Use Update Driven Refresh for Pods #33

Problem

Proposed Solution

PRs