flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.17k stars 551 forks source link

Return non-0 exit code from copilot upload failure #5444

Open ddl-ebrown opened 4 weeks ago

ddl-ebrown commented 4 weeks ago

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Docs link

ddl-ebrown commented 4 weeks ago

Opening this one up for discussion as I think the current behavior is confusing. Not sure if there should be richer exit codes returned / handled elsewhere -- but I don't think returning 0 right now is correct. If a file was supposed to be present and it wasn't OR if the file could not upload, those should be non-0 exit codes.

codecov[bot] commented 4 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 60.99%. Comparing base (bba8c11) to head (69a83fe).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #5444 +/- ## ======================================= Coverage 60.99% 60.99% ======================================= Files 793 793 Lines 51325 51325 ======================================= Hits 31305 31305 Misses 17136 17136 Partials 2884 2884 ``` | [Flag](https://app.codecov.io/gh/flyteorg/flyte/pull/5444/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | Coverage Δ | | |---|---|---| | [unittests-datacatalog](https://app.codecov.io/gh/flyteorg/flyte/pull/5444/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | `69.31% <ø> (ø)` | | | [unittests-flyteadmin](https://app.codecov.io/gh/flyteorg/flyte/pull/5444/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | `58.71% <ø> (ø)` | | | [unittests-flytecopilot](https://app.codecov.io/gh/flyteorg/flyte/pull/5444/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | `17.79% <ø> (ø)` | | | [unittests-flytectl](https://app.codecov.io/gh/flyteorg/flyte/pull/5444/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | `67.97% <ø> (ø)` | | | [unittests-flyteidl](https://app.codecov.io/gh/flyteorg/flyte/pull/5444/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | `79.04% <ø> (ø)` | | | [unittests-flyteplugins](https://app.codecov.io/gh/flyteorg/flyte/pull/5444/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | `61.81% <ø> (ø)` | | | [unittests-flytepropeller](https://app.codecov.io/gh/flyteorg/flyte/pull/5444/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | `57.32% <ø> (ø)` | | | [unittests-flytestdlib](https://app.codecov.io/gh/flyteorg/flyte/pull/5444/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | `65.82% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

wild-endeavor commented 1 week ago

I like the joining of the errors but personally I think the original behavior of "if the error file write succeeds, return success -- otherwise error" more if only because this is the pattern that flytekit follows for the vast majority of its tasks. I think to change this we'd also have to understand how the plugin would handle the case. An error file implies a clean exit, implying that a user error occurred rather than a system error. The case where copilot is not able to find the expected output file is an okay scenario for making it a non-0 exit code, since that's kind of like a system level error, but in the normal case I think it should still be 0.

If you really need it, maybe you can opt in via an environment variable or something... interpret all errors as exit code 1. However I'm not sure if propeller checks for an errors.pb file if that happens though, so if there's information there then it might get lost. This is kinda of an unspoken API, which I don't like. There's a contract between task and engine that's not formally spelled out. I think this is why i'm averse to changing it.

ddl-ebrown commented 5 days ago

I like the joining of the errors but personally I think the original behavior of "if the error file write succeeds, return success -- otherwise error" more if only because this is the pattern that flytekit follows for the vast majority of its tasks. I think to change this we'd also have to understand how the plugin would handle the case. An error file implies a clean exit, implying that a user error occurred rather than a system error. The case where copilot is not able to find the expected output file is an okay scenario for making it a non-0 exit code, since that's kind of like a system level error, but in the normal case I think it should still be 0.

If you really need it, maybe you can opt in via an environment variable or something... interpret all errors as exit code 1. However I'm not sure if propeller checks for an errors.pb file if that happens though, so if there's information there then it might get lost. This is kinda of an unspoken API, which I don't like. There's a contract between task and engine that's not formally spelled out. I think this is why i'm averse to changing it.

CI is green, but not sure if all your concerns are covered in the tests. Could you help me better understand the informal contract? I'm not sure which plugin you're referring to here / the semantics involved.

IMHO, exit codes are the right way to handle the differentiation between system error and user error. While I was at Puppet, we added a --detailed-exitcode switch for exactly this scenario. The initial error code implementation of 1 / 0 was too naive, and it became evident that there was a need to communicate more granular statuses back (and not swallow errors). To not break backwards compatibility, that switch was born (details at https://www.puppet.com/docs/puppet/8/man/agent#usage-notes). Terraform followed the same model (https://developer.hashicorp.com/terraform/cli/commands/plan#detailed-exitcode) as did other tools .

I think it would be pretty straightforward to add a switch and follow that sort of model if you think what's here breaks a contract somewhere?