aai-institute / jobq

https://aai-institute.github.io/jobq/latest
Apache License 2.0
2 stars 1 forks source link

Warn in `jobby list` if there are failed pods to a Kueue workload #89

Closed nicholasjng closed 2 months ago

nicholasjng commented 2 months ago

As per title. We only warn if the job hasn't already failed, in which case you can inspect the job directly to debug.

Addresses the final point for #86, at least for Kueue jobs.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 75.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 56.42%. Comparing base (af9a3cd) to head (4accad5). Report is 3 commits behind head on main.

:white_check_mark: All tests successful. No failed tests found.

Files with missing lines Patch % Lines
client/src/cli/commands/list.py 0.00% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #89 +/- ## ========================================== + Coverage 56.06% 56.42% +0.35% ========================================== Files 61 61 Lines 3018 2997 -21 ========================================== - Hits 1692 1691 -1 + Misses 1326 1306 -20 ``` | [Flag](https://app.codecov.io/gh/aai-institute/infrastructure-product/pull/89/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aai-institute) | Coverage Δ | | |---|---|---| | [backend](https://app.codecov.io/gh/aai-institute/infrastructure-product/pull/89/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aai-institute) | `88.27% <100.00%> (+0.09%)` | :arrow_up: | | [client](https://app.codecov.io/gh/aai-institute/infrastructure-product/pull/89/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aai-institute) | `47.92% <33.33%> (+0.27%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aai-institute#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

nicholasjng commented 2 months ago

Summary: I list all pods for a workload, any failed pod associated with it will have status.phase=='Failed' attached to it.

Technically, this is portable across job types in the sense that any failed pod associated with a cluster resource should trigger an alarm (if, for example, the kuberay operator were to fail, we would get a notifier here as well.)

This requires a working connection to the k8s API server, which makes it fail in CI if not mocked away. (I'm not sure why that is, but maybe you can chime in here.)

AdrianoKF commented 2 months ago

This requires a working connection to the k8s API server, which makes it fail in CI if not mocked away. (I'm not sure why that is, but maybe you can chime in here.)

I'm not surprised it needs a mock, since the endpoint accesses the pods property of the workload, which calls out to the k8s API. I guess one way is to mock the has_failed_pods property as you did, or you could reach for mocking the pods property, so that derived attributes can be computed from it.

In the long run, we would benefit from introducing a few test fixtures to produce mock workloads, and not repeat that logic across the individual tests.