flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.17k stars 551 forks source link

[BUG] Plugin Collector does not include group #5429

Closed trevormcguire closed 1 week ago

trevormcguire commented 1 month ago

Describe the bug

Flytepropeller tracks how many objects there for each resource type (such as TfJob, PyTorchjob, etc). However, the plugin collector go client does not include group names in the apiVerison when searching for these resources (such as kubeflow/v1), so a warning is raised failed to list objects for when attempting to list them.

https://github.com/flyteorg/flyte/blob/75b33f80dacb737a098b223ad24c2fee49351e31/flytepropeller/pkg/controller/nodes/task/k8s/plugin_collector.go#L52-L61

This seems to have been introduced when we were testing out 1.12 and can be traced back to #5238

Here is an example for SparkApplication.v1beta2

{"json":{"routine":"k8s-resource-level-monitor"},"level":"warning","msg":"failed to list objects for SparkApplication.v1beta2: failed to get API group resources: unable to retrieve the complete list of server APIs: v1beta2: the server could not find the requested resource","ts":"2024-05-29T19:25:22Z"}

Expected behavior

Resource Manager is able to correctly list custom resources

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

welcome[bot] commented 1 month ago

Thank you for opening your first issue here! 🛠

runllm[bot] commented 1 month ago

To address the issue where Flytepropeller's plugin collector fails to list objects due to missing group names in the apiVersion, you can modify the ResourceLevelMonitor to include the group names when searching for resources. This can be done by updating the collect function in the plugin collector code.

Here's an example of how you can modify the collect function to include the group names:

func (r *ResourceLevelMonitor) collect(ctx context.Context) {
    list := metav1.PartialObjectMetadataList{
        TypeMeta: metav1.TypeMeta{
            Kind:       r.gvk.Kind,
            APIVersion: fmt.Sprintf("%s/%s", r.gvk.Group, r.gvk.Version),
        },
    }
    if err := r.cache.List(ctx, &list); err != nil {
        logger.Warnf(ctx, "failed to list objects for %s.%s: %v", r.gvk.Kind, r.gvk.Version, err)
        return
    }
    // ... rest of the function
}

Would you like more details on how to implement this fix?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @runllm to ask follow-up questions. See runllm.com for more!

hamersaw commented 1 month ago

@trevormcguire would you be willing to submit a PR to correctly set the API version?

trevormcguire commented 1 month ago

@trevormcguire would you be willing to submit a PR to correctly set the API version?

@hamersaw sure, will work on that