[flytepropeller] Watch agent metadata service dynamically

Future-Outlier commented 3 weeks ago

Tracking issue

https://github.com/flyteorg/flyte/issues/3936

Why are the changes needed?

Users don't need to configure supportedTaskTypes for agent service after this PR merged. We will connect to the agent server (or agent pod) to retrieve agent metadata every 10 seconds.

The config map for agent-service will be like:

  agent-service:
    supportedTaskTypes:
      # - chatgpt
      # - sensor  
      # - spark
    #   - default_task
    #   - custom_task
    #   - sensor
    #   - airflow
    # By default, all the request will be sent to the default agent.
    defaultAgent:
      endpoint: "localhost:8000"
      insecure: true
      timeouts:
        CreateTask: 100s
        GetTask: 100s
      defaultTimeout: 100s

Architecture Diagram

For example, the lifecycle in the propeller with agent will be like below

call the watchAgent function in webapi/agent/plugin.go
send request to localhost:8000 and localhost:8001
update agentRegistry // map[taskTypeName][taskTypeVersion] => Agent
update agentService *core.AgentService // Propeller will use it to route to the correct plugin. In our case, it’s agent-service.
wait for 10 seconds p.cfg.PollInterval.Duration, and go back to 1

agent-service:
    # By default, all the request will be sent to the default agent.
    defaultAgent:
      endpoint: "localhost:8000"
      insecure: true
      timeouts:
        CreateTask: 100s
        GetTask: 100s
      defaultTimeout: 100s
    agents:
      custom_agent:
        endpoint: "localhost:8001"
        insecure: true
        timeouts:
          ExecuteTaskSync: 300s
          GetTask: 100s
        defaultTimeout: 300s
    agentForTaskTypes:
      # It will override the default agent for custom_task, which means propeller will send the request to this agent.
      - chatgpt: custom_agent
      - airflow: custom_agent

What changes were proposed in this pull request?

TBD

How was this patch tested?

single binary

Setup process

Screenshots

Check all the applicable boxes

[x] I updated the documentation accordingly.
[x] All new and existing tests passed.
[x] All commits are signed-off.

Docs link

codecov[bot] commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 56.89655% with 25 lines in your changes missing coverage. Please review.

Project coverage is 60.97%. Comparing base (5cc7f58) to head (297d575).

Files	Patch %	Lines
...yteplugins/go/tasks/plugins/webapi/agent/plugin.go	46.66%	16 Missing :warning:
...yteplugins/go/tasks/plugins/webapi/agent/client.go	64.28%	4 Missing and 1 partial :warning:
...lytepropeller/pkg/controller/nodes/task/handler.go	33.33%	2 Missing and 2 partials :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #5460 +/- ## ========================================== - Coverage 60.99% 60.97% -0.02% ========================================== Files 794 794 Lines 51475 51488 +13 ========================================== - Hits 31398 31397 -1 - Misses 17185 17199 +14 Partials 2892 2892 ``` | [Flag](https://app.codecov.io/gh/flyteorg/flyte/pull/5460/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | Coverage Δ | | |---|---|---| | [unittests-datacatalog](https://app.codecov.io/gh/flyteorg/flyte/pull/5460/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | `69.31% <ø> (ø)` | | | [unittests-flyteadmin](https://app.codecov.io/gh/flyteorg/flyte/pull/5460/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | `58.69% <ø> (-0.05%)` | :arrow_down: | | [unittests-flytecopilot](https://app.codecov.io/gh/flyteorg/flyte/pull/5460/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | `17.79% <ø> (ø)` | | | [unittests-flytectl](https://app.codecov.io/gh/flyteorg/flyte/pull/5460/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | `68.03% <ø> (ø)` | | | [unittests-flyteidl](https://app.codecov.io/gh/flyteorg/flyte/pull/5460/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | `79.04% <ø> (ø)` | | | [unittests-flyteplugins](https://app.codecov.io/gh/flyteorg/flyte/pull/5460/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | `61.85% <59.61%> (-0.01%)` | :arrow_down: | | [unittests-flytepropeller](https://app.codecov.io/gh/flyteorg/flyte/pull/5460/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | `57.29% <33.33%> (-0.02%)` | :arrow_down: | | [unittests-flytestdlib](https://app.codecov.io/gh/flyteorg/flyte/pull/5460/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg) | `65.59% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flyteorg#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Future-Outlier commented 3 weeks ago

@honnix sorry for the late reply and thank you for your patient. After this PR is merged, the watcher is done, and you can build the image to update your propeller.

Thank you so much!

honnix commented 3 weeks ago

Thank you for working on the watcher feature! Could you please write some description to help me understand what this achieves? Thank you.

Future-Outlier commented 3 weeks ago

Thank you for working on the watcher feature! Could you please write some description to help me understand what this achieves? Thank you.

No problem, I will do it, thank you!

Future-Outlier commented 3 weeks ago

Update: Facing issue with race condition, will try to solve it then we can merge it!!

Future-Outlier commented 3 weeks ago

Hi @EngHabu,

The agent watcher can now dynamically update the supported task types for agent-service.

Kevin and I discovered that we need to perform two tasks concurrently:

Update the supported task types for agent-service.
Check if the task type from TaskMetadata is in the supported task types.

Therefore, we need to use mutex lock in our code to handle these operations.

Thank you for reading this and have a nice weekend!

Future-Outlier commented 2 weeks ago

Hi, @honnix, can you take a look at this PR? The implementation might change, but could you help review it to check for potential issues? Thank you!

honnix commented 2 weeks ago

@Future-Outlier I didn't look into details of the change, but overall it's a great feature to be able to refresh agent capability at runtime without restarting propeller.

Future-Outlier commented 2 weeks ago

Hello, @EngHabu. Can you take a look? Kevin and I will fix it ASAP if any changes are needed. Thank you!

Future-Outlier commented 2 weeks ago

Hi, @honnix Before we merge this PR, can you help test this one in your scenario? I can help you build the image you need, here are the actionable steps.

checkout this branch
Use this Dockerfile to build a new image, replace the flytepropeller's deployment and restart the flytepropeller https://github.com/flyteorg/flyte/blob/master/Dockerfile.flytepropeller

honnix commented 5 days ago

Hi, @honnix Before we merge this PR, can you help test this one in your scenario? I can help you build the image you need, here are the actionable steps.

checkout this branch

Use this Dockerfile to build a new image, replace the flytepropeller's deployment and restart the flytepropeller master/Dockerfile.flytepropeller

Sorry for replying late. I was on vacation.

I unfortunately don't have capacity to take on this. I can ask around in the team.

Future-Outlier commented 5 days ago

Hi, @honnix Before we merge this PR, can you help test this one in your scenario? I can help you build the image you need, here are the actionable steps.

checkout this branch

Use this Dockerfile to build a new image, replace the flytepropeller's deployment and restart the flytepropeller master/Dockerfile.flytepropeller

Sorry for replying late. I was on vacation.

I unfortunately don't have capacity to take on this. I can ask around in the team.

No problem, I will be on vacation too! Have a nice vacation!

flyteorg / flyte