Enhancement Request: Support for Garbage Collection Policies and Notification Mechanism

Parthiba-Hazra commented 8 months ago

Description:

As part of the development of the checkpoint-restore-operator for managing container checkpoints, it has been identified that there is a need for enhanced functionality related to garbage collection policies and notification mechanisms. This issue serves to track discussions and development efforts regarding these enhancements.

Proposed solution

1. Granular Garbage Collection Policies: Currently, the operator's proof of concept (POC) implements a global policy for garbage collection. However, it's been proposed to extend the garbage collection mechanism to support granular policies at multiple levels, including per-namespace, per-pod, and per-container policies. This enhancement would provide users with finer control over how checkpoints are managed, allowing them to define policies tailored to specific namespaces, pods, or containers based on their unique requirements and constraints.

2. Custom Parameters for Garbage Collection Policy Adjustment: Additionally, there's a consideration for allowing users to include custom parameters to adjust the garbage collection policy. This enhancement would enable users to further customize the garbage collection behavior according to their specific use cases and preferences, providing flexibility and extensibility in managing container checkpoints.

3. Notification Mechanism for Garbage Collection: Another proposed enhancement is the implementation of a notification mechanism to alert administrators or operators when checkpoints are deleted as part of the garbage collection process. These notifications would include details such as which checkpoints were deleted and the reasons behind their deletion, enhancing visibility and transparency into checkpoint management activities.

This issue will serve as a central point for discussions, planning, and tracking progress related to these enhancements. Contributors are encouraged to share their thoughts, suggestions, and contributions to help shape the implementation of these features in the checkpoint-restore-operator for managing container checkpoints.

Additional context

Some of the earlier discussion on this issue can be found here - https://app.element.io/?updated=1.11.57#/room/#save-restore_CRIU:gitter.im

Please feel free to provide any additional insights or considerations related to these proposed enhancements.

rst0git commented 8 months ago

@Parthiba-Hazra It might be worth looking at how the existing mechanisms for garbage collection are implemented in Kubernetes: https://kubernetes.io/docs/concepts/architecture/garbage-collection/

In particular, it is worth noting that owner references are used to keep track of dependent objects and cross-namespace references are disallowed by design.

Parthiba-Hazra commented 8 months ago

In particular, it is worth noting that owner references are used to keep track of dependent objects and cross-namespace references are disallowed by design.

@rst0git Whenever a owner object's checkpoint is collected for garbage collection by the garbage collection policy, we will do check the if there is any dependent object checkpoint is present or not if present then check dependent object checkpoint's ownerReferences.blockOwnerDeletion field, am I right? I was also thinking about should we also check the dependent objects in the same cluster under same namespace. Let's suppose if there is no checkpoints of dependent objects of a owner object checkpoint, so when that owner object checkpoint is selected for garbage collection should we also check the dependent objects that are present in the cluster under same namespace?

Parthiba-Hazra commented 8 months ago

Hey @adrianreber @rst0git @snprajwal I've been thinking about another garbage collection policy, and I'd like to run it by you all for feedback. Recently, I've been working on the image pull-policy feature within the buildpack project.

Here's the gist of the pull policy: We introduce a pulling interval, whereby images are pulled during build or rebase according to this interval. Additionally, we'll maintain a JSON file to store information regarding the image ID and the last pull time for each image. Users will have the option to set a "pruning threshold" to manage this process. The pruning logic will involve iterating through the image entries in the JSON file after each threshold is reached. Any entries with a last pull time preceding the pruning threshold will be deleted, ensuring that the image is force-pulled in the next build or rebase.

Now, I'm considering extending a similar concept to our garbage collection policy. The idea is to store the timestamp of when a checkpoint was last used (I believe it's during checkpoint restore), and then compare it against a user-defined pruning threshold. Based on this comparison, we can decide whether to delete the checkpoint or retain it. This policy wouldn't be the default but rather a customizable option.

adrianreber commented 8 months ago

The operator currently figures out the checkpoint time by looking at the checkpoint archive. We do not really have a last used mode, because the checkpoint archive cannot directly be used for restore. At first the JSON file seems unnecessary. We know when the checkpoint was created. That sounds like enough information to me.

Parthiba-Hazra commented 8 months ago

The operator currently figures out the checkpoint time by looking at the checkpoint archive. We do not really have a last used mode, because the checkpoint archive cannot directly be used for restore. At first the JSON file seems unnecessary. We know when the checkpoint was created. That sounds like enough information to me.

OK now it's clear, I am also concerning about is it really possible to get the last used time, and as it's not possible then, yes the JSON file is not needed.

Parthiba-Hazra commented 7 months ago

I've been considering some additional features for our operator apart from the garbage collection policies

Automatic Checkpoint Creation: Add feature to automates the creation of checkpoints based on predefined triggers or schedules.
Automatic Replacement of Checkpoints: Add an option to enables the automatic replacement of old checkpoints with new ones.
Storage Limit or Quota Management: Users can set storage limits per container, pod, or namespace to prevent unchecked growth.
Metrics Collection and Alerting: Calculate checkpoint creation rate and monitor storage exhaustion, triggering alerts accordingly.
Event Logging: Logging all checkpoint creation and deletion events for audit trails and troubleshooting.

Do these align with our automation goals? I believe these features could enhance our operator's functionality and usability, but I'd love to hear your thoughts and any additional ideas you might have. Thanks!

checkpoint-restore / checkpoint-restore-operator