Currently, the way we track whether a machine is removed is by comparing two lists, _previously_cachedlabels, which contains the list of machines in the previous collection job, and _currently_cachedlabels, which contains the list of machines found in the current collection job. Any extra machines found in the previous job and not the current job is considered to be removed.
This approach works fine for most scenarios, but we observed a subtle error in an edge case: if the removed machines are not successfully deleted from the registry at the time count discrepancy is recorded by the collector, their labelsets would remain there and becomes unknown to the collector after _previously_cachedlabels gets overwritten at the beginning of the next collection job.
One idea to prevent this error would be, instead of using a _previously_cachedlabels list to cache the machines, to always retrieve labelsets from registry and compare it to the current machine list. The possibility of this approach depends on whether prometheus python client provides a method to get all labelsets in registry.
Currently, the way we track whether a machine is removed is by comparing two lists, _previously_cachedlabels, which contains the list of machines in the previous collection job, and _currently_cachedlabels, which contains the list of machines found in the current collection job. Any extra machines found in the previous job and not the current job is considered to be removed.
This approach works fine for most scenarios, but we observed a subtle error in an edge case: if the removed machines are not successfully deleted from the registry at the time count discrepancy is recorded by the collector, their labelsets would remain there and becomes unknown to the collector after _previously_cachedlabels gets overwritten at the beginning of the next collection job.
One idea to prevent this error would be, instead of using a _previously_cachedlabels list to cache the machines, to always retrieve labelsets from registry and compare it to the current machine list. The possibility of this approach depends on whether prometheus python client provides a method to get all labelsets in registry.