ga4gh / workflow-execution-service-schemas

The WES API is a standard way to run and manage portable workflows.
Apache License 2.0
82 stars 38 forks source link

Added new endpoint to paginate a Workflow Run task_logs #177

Closed patmagee closed 1 year ago

patmagee commented 2 years ago

An increasingly common usability problem encountered with the WES API is that very large (wide) workflows can become next to impossible to handle if they have a large number of tasks. The current behaviour requires all of the workflow tasks to be returned as part of the body of a RunLog request. Embedding 10,000+ tasks in the response json can lead to substantial performance degredation on the side of the client as well as the server.

This issue is not strictly present when the number of tasks is beyond a certain threshold. Depending on the WES implementation, retrieving a complete list of all the completed and ongoing tasks could be exceptionally expensive. providing an alternative strategy to page through tasks helps mitigate this problem, allowing these operations to be split up into many smaller request.

This PR Addresses this issue by introducing two new features:

  1. A new endpoint under a /runs/{run_id}/tasks to return TaskListResponse
  2. A new optional query param in the /runs/{run_id}

Retrieve Task Logs

For the workflow run 10002, we can retrieve the paginated list of tasks with a page_size of 2. if the page_size is not defined, then the response should use a default value.

GET /ga4gh/wes/v1/runs/10002/tasks?page_size=2
{
  "task_logs": [
    {
      "name": "hello_world.echo_1",
      "cmd": "echo \"Hello Some sort of String\"\n>&2 echo \"Goodbye Some sort of String\"",
      "start_time": "2020-09-02T15:05:54.571Z[UTC]",
      "end_time": "2020-09-02T15:08:58.954Z[UTC]",
      "stdout": "/logs/task/hello_world.echo/0/stdout",
      "stderr": "/logs/task/hello_world.echo/0/stderr",
      "exit_code": 0
    },
    {
      "name": "hello_world.echo_2",
      "cmd": "echo \"Hello Some sort of String\"\n>&2 echo \"Goodbye Some sort of String\"",
      "start_time": "2020-09-02T15:05:54.571Z[UTC]",
      "end_time": "2020-09-02T15:08:58.954Z[UTC]",
      "stdout": "/logs/task/hello_world.echo/0/stdout",
      "stderr": "/logs/task/hello_world.echo/0/stderr",
      "exit_code": 0
    }
  ],
  "next_page_token":"cGFnZV9zaXplPTQ7cGFnZT0y"
}

Now using the next_page_token returned by the previous request, we can fetch the next page of at most 2 tasks from the WES server for this run.

GET /ga4gh/wes/v1/runs/10002/tasks?page_token=cGFnZV9zaXplPTQ7cGFnZT0y
{
  "task_logs": [
    {
      "name": "hello_world.echo_3",
      "cmd": "echo \"Hello Some sort of String\"\n>&2 echo \"Goodbye Some sort of String\"",
      "start_time": "2020-09-02T15:05:54.571Z[UTC]",
      "end_time": "2020-09-02T15:08:58.954Z[UTC]",
      "stdout": "/logs/task/hello_world.echo/0/stdout",
      "stderr": "/logs/task/hello_world.echo/0/stderr",
      "exit_code": 0
    },
    {
      "name": "hello_world.echo_4",
      "cmd": "echo \"Hello Some sort of String\"\n>&2 echo \"Goodbye Some sort of String\"",
      "start_time": "2020-09-02T15:05:54.571Z[UTC]",
      "end_time": "2020-09-02T15:08:58.954Z[UTC]",
      "stdout": "/logs/task/hello_world.echo/0/stdout",
      "stderr": "/logs/task/hello_world.echo/0/stderr",
      "exit_code": 0
    }
  ]
}

Exclude Task Logs from RunLog

An important feature for helping with scalability is the ability to exclude task_logs from the RunLog response body. Adding the separate endpoint only makes sense if this behaviour is toggleable by the end user.

The new changes specifies that omitting the exclude_task_logs query param is equivalent to explicitly passing exclude_task_logs=false. In either of these cases, then the task_logs MUST be included as per the current behaviour of WES. That is, All logs should be included. This avoids breaking the API and requiring a major version bump.

When a user passes in exclude_task_logs=true, the task_logs property in the response should either be omitted entirely, or set to null

GET /ga4gh/wes/v1/runs/10002?exclude_task_logs=true
{
  "run_id": "10002",
  "request": { ... },
  "state": "RUNNING",
  "run_log": { ... },
  "task_logs": null,
  "outputs": { ... }
}
wleepang commented 2 years ago

When setting exclude_task_log=true, rather than have "task_logs": null, could it instead point to the endpoint that lists the tasks? So:

GET /ga4gh/wes/v1/runs/10002?exclude_task_logs=true
{
  "run_id": "10002",
  "request": { ... },
  "state": "RUNNING",
  "run_log": { ... },
  "task_logs": "/ga4gh/wes/v1/runs/10002/tasks",
  "outputs": { ... }
}
patmagee commented 2 years ago

@wleepang that was something I thought of. I was not quite sure how well the polymorphism would sit with client frameworks. I would probably add a key to the run log instead then, maybe task_logs_url or something similar.

wleepang commented 2 years ago

@patmagee - that's a valid point. Having the additional task_logs_url key is a good idea.

patmagee commented 2 years ago

@wleepang I added a new key to the request. I am also specifically trying to preserve current behaviour so as not to trigger a version bump

pgrosu commented 2 years ago

This is nice, but I have two questions:

1) If the task_log is collapsed then how does one know which are the problematic tasks (without going then through a paginated 5000+ list of 2/page)? 2) If one would need to paginate through 10,000+ to find the problematic one -- which might be a bit much -- then wouldn't it be easier to have an endpoint that filters by problematic exit_code values instead to provide just a list of paginated stderr locations?

~p

patmagee commented 2 years ago

@pgrosu you raise a very valid point. Some form of filtering would be desirable, however I wanted to mimic the current pagination API as much as possible as well (which does not support arbitrary slices and filters). Maybe there is a different way to communicate failures.

If an individual task was addressable (currently it is not, and tasks do not have an id), then we could provide a GET endpoint to retrieve a single task. We could also report failed task urls in the RunLog

pgrosu commented 2 years ago

In theory tasks could be auto-identifiable, since you can create a hash of the static elements (data and program [maybe resource]) they represent. So a map of { hash -> task/task-elements } can be automatically be generated and stored to disk.

What if workflow of 10,000+ could in theory be considered tiny -- sounds shocking, right? Usually workflows components could run faster over time as more data is added if things are properly cached at multiple levels of {(input_data, program) -> output_data} except for inceptive runs. For any task, there should be task-recovery policy that would be auto-trigged, so the task would put itself in a queue requesting a combination of {data, program, resource} for completion given the error patterns. Individual errors should rarely be inspected for large workflows, but rather error-resolution policies { error_pattern(task-id) -> restart_task( update_task_elements(...) ) } should applied for workflows to progress forward properly. This can be have an associated cost-hierarchy to it so the maximum number of tasks are satisfied given a cost upper-limit.

Ask yourself what types of experiments you could run if workflows were almost instantaneous? Now given that -- and ever larger workflows -- how could human-intervention be minimized and ensure that on average they complete on their own? Ideally workflows would pause-and-continue/restart until the correct {data, program, resource} set is available to progress forward. Pagination is nice, but it can become bottleneck.

Details are important, but if we limit ourselves then how ubiquitously will our implementation be adopted? If our worry now about not triggering a version bump, maybe parts of our design might not be proper for the kind of science we might need to be doing, right?

Hope it helps, ~p

wleepang commented 2 years ago

I think that setting a task caching and reuse policy is something that can be made in the RunWorkflow request and is probably best suited to a separate discussion.

The point of quickly identifying a failed task in a list of 10,000+ is valid and lends me to think about the API pattern of

We have this already with:

Perhaps we should consider:

This provides a relatively small response for /runs/{run_id}/tasks - back of the napkin puts it around 1-2MB for 10,000 tasks if returned all at once - that can be easily filtered by key properties (i.e. status, start_time, and exit_code). Additional information, like name, cmd, stdout, stderr, etc are relegated to individual task details retrieved from /runs/{run_id}/tasks/{task_id}.

patmagee commented 2 years ago

@wleepang that approach feels much more natural and aligns well with the current behaviour of WES. The only thing I am hesitant about is the necessary introduction of two new properties: task_id and status. I would hope it is a fair assumption to make that every engine would have some sort of identifier that they can use for a given task, but I could easily see it not being universal.

Status is something that I would like to have as well. seeing the return code is helpful, but knowing if the wes engine treats that return code as a failure or not is even more helpful IMO. I know in the WDL sphere, "Success" can be loosely defined by allowing non 0 status codes to indicate success. Seeing a non 0 status code in this scenario really would not help you identify what is "broken" or needs to be retried. I guess though I have the same question, is status universal? Can most engines assign a status to a task or is that something only reserved for a workflow

patmagee commented 1 year ago

@wleepang I am wondering if we can take a two step approach here to not introduce any sort of hard breaking changes and get this PR merged asap

  1. 1.2 release
    1. Introduce the /tasks endpoint which allows a user to strictly paginate over a list of tasks, without making the tasks addressable.
    2. Add a new property to the RunLog for task_logs_url but do not get rid of the task_logs object
    3. Issue a deprecation notice for the task_logs property in the RunLog
  2. 2.0 Release
    1. Remove the task_logs property from the RunLog
    2. Make tasks independantly addressable
    3. Add an additional endpoint to retrieve a single task (/tasks/{taskId})
wleepang commented 1 year ago

@patmagee I think that is a reasonable plan of action

patmagee commented 1 year ago

@pgrosu what do you think about that approach?

pgrosu commented 1 year ago

@patmagee That sounds reasonable to me as well. I don't think many folks will create heavy dependencies on task_logs, that when when we get to 2.0 it becomes a breaking change.

patmagee commented 1 year ago

@pgrosu @wleepang I have reflected the changes in the openapi spec and not just the swagger. Also I have tweaked the wording in a few places AND i have removed the exclude_task_logs query param for the /runs/{run_id} endpoint. This query param felt strange to introduce if we are marking the task_logs as deprecated for the next major release anyways

vinjana commented 1 year ago

There is only a next-page token field. Generally, when paging, and in particular with 10000+ tasks, you may want to avoid to having to iterate through all pages. Did you consider adding a field for the total task count and a way to address each page, e.g. with a second parameter e.g. ?page_size=2&page=4999? A use-case may be lazy-loading (e.g. for a dashboard in the situation the user scrolls).

Furthermore, I am just wondering how useful it is to have too much task-level information in the WES API. All the information that you list for the tasks (name, command, etc.) could equally be provided by the TES-API. Wouldn't it be more appropriate to reuse the TES-API for TES-specific information -- even in the WES-API? Just a thought. A WES implementation that does not use a TES in the background could still provide this information via a TES-API conform endpoint, or not?

uniqueg commented 1 year ago

I missed a lot of this conversation, but I strongly support @vinjana's suggestion of reusing (parts of) the TES API instead of introducing a new /tasks endpoint (very confusing)!

patmagee commented 1 year ago

I like the idea of combining the APIs to provide a richer set of information to the end user, however I feel like it introduces a large amount of complexity to both the end user and the implementor. At the moment as a user for WES, the only tasks I care about for a given workflow are those run actually in a workflow. They are conveniently located in the Run object, but that simply does not scale (Hence This PR). So moving them to a separate API entirely (/ga4gh/tes/v*/tasks) Introduces a new barrier to the user as it immediately requires they 1) understand the TES API and 2) is able to know how to filter the runs to include only those currently in the workflow they want. Additionally, Embedding the TES api in the WES API feels like we will potentially run into a versioning issue if we are not careful.

As an implementor, (especially an implementor that layers on top of other engines), it is simply not possible to have the task level metadata required to build out a TES response for ALL tasks across all workflows at once. In the engine world most data is subsetted at the level of the workflow so you would at least need to slice the task level metadata here.

I think there are potentially two different approaches we can take to align with TRS a bit better.

1) Instead of being really strict on an API for the task level data align that the API should be TES, but in the Run object there should still be a task_log_uri which points to the filtered (and paginated) set of tasks for that given run. This has the benefit of not necessarily needing to fix the TES version or a full TES API, but potentially does make it harder to reliably get the information if you are expecting a specific version and the API returns a different version. The implementor must implement at least a part of the TES api, but they have full control over where it exists. The challenge with this approach is that it does not follow a common convention so the flow would always need to be: Get Run -> Extract Task Log Uri -> Get Tasks as opposed to just Get Tasks

2) the /ga4gh/wes/runs/{id}/tasks endpoint returns a TES payload but does not implement the FULL TES API (Ie cannot create new TES tasks). We would need to probably fix the TES version that is being embedded in the API, but the user could immediately use Get TASKS.

vinjana commented 1 year ago

Right, I can also not see now how one could ensure that only the tasks of the specific workflow run are being displayed, because this would have to be embedded in the URL pointing to the TES-like route/endpoint, and that API probably does not provide a canned way of doing this (I admit, I am not familiar with the TES-API :sweat_smile:).

Your second suggestion, to put an route and response into the WES specification that is just like some fixed TES versions response makes sense to me.

Giving it a second thought, I would find it important though, to allow for page selection. Looking at the TES specification this does not seem to be possible or planned in the moment. Sorry for not checking that before!

So maybe I can tone down my suggestion: It would be nice to find a harmonized approach to pagination in WES, TES, and possibly other GA4GH standards that also allows for page selection. I am not sure how realistic this is.

patmagee commented 1 year ago

@vinjana YES I 100% agree

a harmonized approach to pagination in WES, TES, and possibly other GA4GH standards that also allows for page selection. I am not sure how realistic this is.

This is what I would really love. At the moment this is all over the place and the specs really feel quite separate. Even the payloads that they return. For example, TRS returns an array of objects into of an object with an array in it. This makes it impossible to embed additional information anywhere except headers. IMO headers are problematic to use for pagination, since you may want to be able to pass the response body around, having a different component fetch the next page of results potentially.

But this ticket: https://github.com/ga4gh/workflow-execution-service-schemas/pull/183 also talks about pagination and pagination metadata. So I would really love to discuss that

uniqueg commented 1 year ago

Re-reading this thread, I thought about a potential alternative to the two approaches for better alignment with TES outlined by @patmagee here. Perhaps we could define a TES URI (similar to TRS URI and DRS URI) and include an optional property tes_uri in the task log schema. Of course, given that Log is used both for the workflow and task logs, we might have to define a separate schema just for the tes_uri property and then make use of allOf to make sure that both are returned for the task but not the WES logs (or similar). Or we could leave it in there and call it resource_uri instead and then populate it with either a TES URI or the WES URI of the current task.

I also would like to track back on my initial opinion that I would like to see reusing TES schemas here. I simply forgot/misunderstood that we already have the Log schema to represent task logs in WES and that this PR merely places them somewhere else. If we (ever) have a way of sharing schema blocks across APIs we could still revisit the design (as well as other similar issues like pagination, run/task listings and filtering) for a combined major version upgrade in WES and TES.

patmagee commented 1 year ago

@uniqueg would you be happy with the current approach as-is. Your proposal could easily be added at a later time in a non breaking, especially since there is no such thing as a TES_URI At the moment

uniqueg commented 1 year ago

Yeah sure, absolutely :)

patmagee commented 1 year ago

There does not appear to be any one opposed to merging this into the spec as is, so I am considering this change accepted. Final acceptance of this feature will be done during review of the next release