flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.63k stars 623 forks source link

[Core feature] Add execution id as an available log parameter for log links UI #3696

Closed btang-stripe closed 1 year ago

btang-stripe commented 1 year ago

Motivation: Why do you think this is important?

We want to integrate other products with Flyte and be able to link to the logs. Specifically we'd like to link to wandb which requires an identifier created at runtime. Using the Flyte execution id as a wand group id would allow us to create the log link

Goal: What should the final outcome look like, ideally?

the execution id of the workflow should be available as a log parameter

Describe alternatives you've considered

Providing the ability to pass arbitrary strings as log parameters would be ideal, then we could pull the actual wandb run id instead. This, however, would not be as straight forward as passing the Flyte execution id.

Propose: Link/Inline OR Additional context

Currently the plugin machinery does not have the execution id. Talking with @kumare3 this should be updated here and here

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

welcome[bot] commented 1 year ago

Thank you for opening your first issue here! 🛠

kumare3 commented 1 year ago

Thank you for opening the issue. I really like the idea of adding execution id to the log parameters so that it can be used to power things like weights and biases directly from Flyte UI. So this would be used for groupID, what about the run-id / experiment id itself?

Also we should have a capability of turning it on and off per task. I think tasktemplate.config is a great fit for such a thing

@task(obs_config="{"wandb":"on", "...."}
def foo():
  ...

Something like that?

hamersaw commented 1 year ago

@fg91 if I recall you discussed some deeper integration with wandb - mind summarizing here?

fg91 commented 1 year ago

@fg91 if I recall you discussed some deeper integration with wandb - mind summarizing here?

I wouldn't make it specific to wandb. I described our use case here in the rfc incubator:

We would like to be able to add links to experiment tracking servers like Mlflow, Wandb, ... We typically add tags with e.g. the flyte task project, domain, version, and execution id (retrieved from the flyte context) to the runs in our experiment tracking server. If we had access to these values during link templating, we could provide links to the run/experiment in e.g. Wandb in the Flyte console. Currently we log this link and have to go to the stackdriver logs to retrieve it.

The link template could, in a general case, look for instance like this: https://my-experiment-tracking-server.com/?tag={{.executionId}}

Configuring this link template on the platform side in the helm values would solve our use case. However, the assumption is that tasks then actually create the required tag in the experiment tracking server. Before this happens, the link points nowhere. Also, not all tasks might create a wandb run but would then still have a link to wandb in the Flyte console.

It would, thus, be even nicer if one could add log links to the flyte console from the task itself. If I understood @cosmicBboy correctly in the contributors sync, he proposed something like this:

@task
def train():
    tag = flytekit.current_context().execution_id
    my_experiment_tracking_api.add_tag(tag)
    flytekit.current_context().add_log_link(f"https://my-experiment-tracking-server.com/?tag={tag}")

However, would it be tricky to make this link show up in the flyte console immediately and not only later when the task finished? For long running trainings we would like to monitor the loss curves and are looking for easy access to this link before the training ends.

If adding the link at runtime is complicated, I could also imagine the following (similar to what @kumare3 proposed above but not specific to wandb):

@task(log_links="{"wandb": f"https://my-experiment-tracking-server.com/?tag={{.executionId}}")
def foo():
  ...

Users, in their task, would then still have to make sure that this link points to something meaningful but it would be on a per-task basis and not for the entire platform.

kumare3 commented 1 year ago

@fg91 exactly I do not think it should be specific to wandb

fg91 commented 1 year ago

So this would be used for groupID, what about the run-id / experiment id itself?

I think it should be up to the user which tags/names they want to assign to their runs (aka whether they use execution id or a combination of execution id and version id or whether the project or domain is in there ... Maybe even which node in the workflow or which retry). Basically I think it would be great if all these vars that are already (partially )available in the flyte context could be used to template links.

fellhorn commented 1 year ago

Would love to see this feature! So far we worked around it by using Deck. As the link only shows up after the execution finished, a solution to show it already during execution would be great.

@task(log_links="{"wandb": f"https://my-experiment-tracking-server.com/?tag={{.executionId}}")
def foo():
  ...

This would be an elegant solution for wandb but might bring some limitations for other tracking servers. ClearML for example uses UUIDs for both the project and the experiment:

https://my.clearml.server/projects/b917e1f75f7642ab9c6eb8e1b13e2b0a/experiments/c53a21348bda41aca76f912481bb85e3/execution

One could get around this limitation by searching for a tag in all projects though: https://my.clearml.server/projects/*/experiments/?filter=tags:{.executionId}

Linking to a single run is probably not possible with the decorator approach though.

If specifying the link at runtime was possible and the link then shows up immediately, it would probably be the most versatile solution.

hamersaw commented 1 year ago

I think including additional log links the the @task decorator is far easier than using a flytekit context to add at runtime. The former would require some communication mechanism between flytekit and propeller or admin which currently does not exist, this gets far more complicated with intra-cluster deployments.

Adding logging links to the TaskTemplate would be a pretty easy add, would be happy to help anyone through the changes if this is the route we decide to take.

@btang-stripe thoughts?

vkaiser-mb commented 1 year ago

Not sure if this would be a follow up discussion or part of this: For our usecase I would love to have additionally the user_id which would allow to see which user did start the experiment. Also I wonder how I would automatically create a new experiment for each task. I would image something like:

@task(log_links="{"wandb": f"https://my-experiment-tracking-server.com/?tag={{.executionId}}")
def train():
    fqn= flytekit.current_context().task_fqn
    version = flytekit.current_context().task_id
    run_id = flytekit.current_context().execution_id
    user_id = flytekit.current_context().user_id

    experiment = my_experiment_tracking_api.get_or_create(fqn)
    experiment.version(version)
    experiment.run_id(run_id)
    experiment.user_id(user_id)

The downside is that people would still need to write this. I think it would be even more awesome if this kinda logic could be abstracted:

 @task(plugins=[wand_plugin])
def train():
    ...

def wand_plugin(task_precontext):
    fqn= task_precontext.task_fqn
    version = task_precontext.task_id
    run_id = task_precontext.execution_id
    user_id = task_precontext.user_id

    task_precontext.add_log_link(f"https://my-experiment-tracking-server.com/?tag={run_id}")

    experiment = my_experiment_tracking_api.get_or_create(fqn)
    experiment.version(version)
    experiment.run_id(run_id)
    experiment.user_id(user_id)

What do you think about it or how do you solve those challenges?

kumare3 commented 1 year ago

@vkaiser-mb have you seen the the mlflow plugin - precontext can be done, but sadly today Flyte engine cannot read partial outputs from a task. Something folks have asked about in flytedecks as well

btang-stripe commented 1 year ago

These are all great suggestions! After taking another look at the actual run link on wandb — providing just the execution id would actually still not be enough to properly link back from flyte console. We'd still be missing the wandb project (which may not match the flyte project), and would like to allow users to customize their run groups.

I think @fg91 's suggestion of adding something directly to the flytekit context would work nicely for our case as it would be flexible enough to handle effectively any url schema eg.,:

flytekit.current_context().add_log_link(f"https://my-experiment-tracking-server.com/{team}/{project_name}/groups/{execution_id}")

The url definition at the task level — while convenient — has limitations in just being able to use the parameters that are available in the log links, which is insufficient for W&B

vkaiser-mb commented 1 year ago

@kumare3 thanks for the mlflow pointer! I would totally agree that there should be a way to communicate with the backend and update the status during execution for many usecases.

But I would hope that you can generate a link to an external webapp already before the execution is started (so there is no waiting time before you can see the link). @btang-stripe where or how would you get the other variables (team, project_name) from? Would this also be availalbe before execution?

vkaiser-mb commented 1 year ago

I checked the mlfow plugin example, but if I am not getting it wrong, it just wraps the trainings function and will be executed when the task is started. I would argue that the links should be created in advance, thats why in my example above its part of the task definition. This would also avoid the problem that you would need the task to end before anything can be seen on the backend. And in the task_precontext there should be all the information that are available before executing the task which come from the backend.

Again I am not sure if this still part of this issue, but I also posted some ideas for UI costumization of mine here in the discussion part: https://github.com/flyteorg/flyte/discussions/3569

vkaiser-mb commented 1 year ago

Something I just realized and which would vote against my code proposals is that the webapp (wandb) config needs to be done from inside the task and can not be done in advance. Therefore forget about my suggestion having it in the task decorator. You probably realized that already. My personal feeling is that we would need to go the way @fg91 and @cosmicBboy suggested. Which would need some interactive commincation to the the backend so the link can popup immediately. (Which I think is anyways a very favorable feature and maybe would also allow something like streaming STDOUT)

fg91 commented 1 year ago

I think including additional log links the the @task decorator is far easier than using a flytekit context to add at runtime. The former would require some communication mechanism between flytekit and propeller or admin which currently does not exist, this gets far more complicated with intra-cluster deployments.

Summary of a discussion in the contributor's sync (25/5/2023):

Creating such a communication channel and allowing user code to directly talk to backend services would be a major and undesirable architectural change because of the potential implications on reliability.

A solution that was favoured in the discussion is to allow the user code to flush the deck while the task is still running. Currently, the deck is only uploaded to blob storage after the task has finished. This is typically too late to show links to experiment tracking servers in the Flyte console as one might want to click on them to observe the progress of the training before the training finished. If the user could trigger the upload of the deck to blob storage, one could do the following:

@task
def train():
    run = some_experiment_tracking_api.start_run()
    deck = flytekit.Deck("Summary", ...)
    deck.append(MarkdownRenderer.to_html(f"... {run.link} ..."))
    deck.flush()

This would then not appear in the logs section but would accomplish the same goal.

What do you think about this @vkaiser-mb ?

hamersaw commented 1 year ago

For another idea, if the limitation of adding a static log template to the task decorator is the limitation of log parameters, then we could extend this scheme to take parameters from elsewhere in the task definition, for example environment variables or pod labels / annotations. These can be set at the workflow or task level and then applied immediately on task start. Log links could be like:

https://my-experiment-tracking-server.com/{{ .env.team }}/{{ .label.project_name }}/groups/{{ .execution_id }}

So advantages over the previous (1) much cleaner - no 3rd party dependencies (2) available immediately on task execution, but of course not as flexible. I don't think this will work for the clearML case described above. It looks like these schemes are unable to be templatized, which seems like poor design to me. I have to admit I'm not familiar with most experiment tracking frameworks.

vkaiser-mb commented 1 year ago

@fg91 I like the idea of a flush. From UX perspective I would prefer to see a link similiar to the other loglinks already on task level and no need to press the Flyte Deck button that again opens an overlay. So I wonder if something like this would be possible:

@task
def train():
    fqn= flytekit.current_context().task_fqn
    version = flytekit.current_context().task_id
    run_id = flytekit.current_context().execution_id
    user_id = flytekit.current_context().user_id

    # Here is the interesting part
    flytekit.current_context().log_links.append(f"https://my-experiment-tracking-server.com/?run={run_id }")
    flytekit.current_context().flush
    # Or flytekit.current_context().log_links.flush()

    experiment = my_experiment_tracking_api.get_or_create(fqn)
    experiment.version(version)
    experiment.run_id(run_id)
    experiment.user_id(user_id)

I guess the idea of @hamersaw could also work in the w&bs usecase.

davidmirror-ops commented 1 year ago

2023-06-08 Contrib meeting notes: in terms of usability, option 2 presented at the meeting is preferable but potentially requires more effort. (See https://hackmd.io/gmwGOpD_TWiyw9_dJ_34Dg).

fg91 commented 1 year ago

@btang-stripe are you ok with closing this issue and instead continuing the discussion how to best implement this in the RFC incubator?

btang-stripe commented 1 year ago

Awesome — yep! let's continue the discussion there