flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.81k stars 661 forks source link

[Core feature] Remove sys_ptrace dependency in raw container task #2162

Open wild-endeavor opened 2 years ago

wild-endeavor commented 2 years ago

Motivation: Why do you think this is important?

We've been getting more questions like this in the open-source community.

Is there a way we can improve the operation a bit. Why do we need to have special permissions?

Getting inputs into the system: Run a light-weight image as an init container - it should do nothing except run flytecopilot or something and just download the inputs.pb file and deserialize them. (Existing behavior where primitives are unpacked into {{ .inputs.x }} should also continue.)

Getting outputs out of the system: The user-container should write one of the following:

The sidecar container should continue to be there. But instead of monitoring the process, it should monitor the outputs folder, which should be cross-mounted.

After it detects one of the two output files, it should begin a timer. After some amount of time, if an empty file named SUCCESS isn't present, it should fail the task.

Once the SUCCESS file is present, it should (serialize and) upload the file to the right location.

If the task doesn't have an output it should still write the SUCCESS file to indicate that it's finished and copilot would upload an empty outputs.pb file (if that's what the system needs).

After copilot is done uploading the outputs, the process should exit. The raw container plugin should be changed to monitor the sidecar container (if it doesn't already). Usual task-timeouts apply.

Also improve documentation a bit. Documentation should state precisely what the contract is between the raw container, and the Flyte system. If it's necessary to configure the cluster a certain way in order to run this task type we should state so clearly.

Goal: What should the final outcome look like, ideally?

see above

Describe alternatives you've considered

see above.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

kumare3 commented 2 years ago

@wild-endeavor problem is what if the process crashes? The SUCCESS file is already supported by co-pilot. It is not reliable. SYS_PTRACE allows you to know when the process has exited, potentially without writing the task.

After copilot is done uploading the outputs, the process should exit. The raw container plugin should be changed to monitor the sidecar container (if it doesn't already). Usual task-timeouts apply. this is already the case.

Also improve documentation a bit. Documentation should state precisely what the contract is between the raw container, and the Flyte system. If it's necessary to configure the cluster a certain way in order to run this task type we should state so clearly.

Already an issue for this.

EngHabu commented 2 years ago

We also need to update docs and examples if we are going with the SUCCESS file convention.

Pushing to 1.0.1 for now for further discussion.

eapolinario commented 2 years ago

@pingsutw , can you leavea comment explaining why this was closed?

pingsutw commented 2 years ago

@eapolinario Spotify is still using sys_ptrace in the raw container, so we keep it in flytecopilot. btw, we already have file access mode (Upload the literal once the SUCCESS file is present) in copilot. These two modes (file access and sys_ptrace) are configurable now.

eapolinario commented 1 month ago

Revive https://github.com/flyteorg/flyteplugins/pull/264/ and introduce the ability to configure the watcher type in the flytecopilot config.

eapolinario commented 1 month ago

What do you think of making the registration of raw containers optional?

We register it as a task type here, but we could provide a flag to control the inclusion of raw-container as a valid task type.