feat: Add extra fields to RunStatus to provide additional information

patmagee commented 2 years ago

The RunStatus object is quite minimal, and does not provide a lot of information to end users, requiring them to make additional requests to the /runs/{id} endpoint multiple times to find basic info about their runs. Additionally, when building tooling to support interaction with the WES API, it is very limiting to only be able to display two fields to a user.

This Pull request adds 4 fields to the RunStatus object which should all be readily available to the implementor. These fields were chosen because of their ability to help a user uniquely identify individual runs, and see how they progress. I attempted to strike a balance between response size and utility.

start_time - Important for sorting, but also serves as a visual reminder to an end user to help provide context of
end_time - Important for sorting. Helps quickly determine how long a run took which when viewed in relation to other runs is generally very helpful
workflow_url - It is often very helpful to know what workflow is being run. it is challenging to correlate a run_id with a specific run without having the prompt of the workflow name
tags - This is potentially something that we can exclude, however it is really really helpful to have additional context when listing runs. This parameter alone helps build powerful UI's and CLI's and will help with adoption

{
    "runs": [
        {
            "run_id": "01feb950-8e5e-405d-8804-43239171a52e",
            "state": "QUEUED",
            "workflow_url": "align.wdl",
            "tags": {
                "sample": "1"
            }
        },
        {
            "run_id": "47ec3102-5ca0-4aa8-bd24-a54cca629319",
            "state": "RUNNING",
            "workflow_url": "hello_world.wdl",
            "start_time": "2021-09-30T12:00:00Z",
            "tags": {
                "sample": "2"
            }
        },
        {
            "run_id": "c42c54fc-4de2-431d-a9b6-9584885f4f40",
            "state": "COMPLETED",
            "workflow_url": "hello_world.wdl",
            "start_time": "2021-09-29T12:00:00Z",
            "end_time": "2021-09-30T12:00:00Z",
            "tags": {
                "sample": "3"
            }
        }
    ]
}

uniqueg commented 2 years ago

In principle, I like this, but I think it would be a missed opportunity to harmonize WES and TES behavior. In the latter, there isn't an equivalent of /runs/{id}/status. Instead you can pass a view query param that can take on the values MINIMAL, BASIC and FULL, which return (more or less) the equivalents of, respectively, what the current /runs/{id}/status, the /runs/{id}/status described here and the /runs/{id} endpoints return.

Not saying that the TES solution is necessarily better, it was just always rubbing me a bit that these are implemented differently across WES and TES, and so before merging this we should perhaps have the discussion whether it wouldn't be better to make WES and TES behave similarly here, whatever the preferred solution may be.

patmagee commented 2 years ago

@uniqueg I like the idea around harmonization, esepcially as a way for making interop between the two API's easier. Another way to phrase this may be: What actually constitutes each of those levels. I suppose the current implementation could be considered MINIMAL, while the above may be BASIC. My one concern there is that if the MINIMAL view does not provide enough information to be useful to end users and API Consumers, is there a reason to keep it? If we went down that path, its possible that the default actually is BASIC?

uniqueg commented 2 years ago

@patmagee Yes, good point. I agree with you that the current implementation does not provide enough information for end users, but it may still be useful for client when they are just polling to get status updates on runs and tasks in WES and TES (especially since we don't have specified a callback mechanism yet, see #132 and there's also a corresponding issue in TES). In our use case, where we are polling TES every couple of seconds to see if the status changed, this happens a lot, and so a smaller response is preferable. In a more sophisticated client, we might query BASIC or FULL once at the beginning to populate a table of tasks/runs, then query just for the status, then query BASIC or FULL once again once a final state (error or success) has been reached.

patmagee commented 2 years ago

@uniqueg I can definitely appreciate the overhead that polling every few seconds would add on a system. However i think in this case the amount of data that we are returning, even with the additional fields is quite small. In the example above each object is approximately 125 bytes long, meaning 10,000 run status objects on a single page would only be 1.2mb. Since most systems return paginated responses, I doubt we ever would really see issues with performance from including multiple fields.

On the other hand, given how large a single workflow can actually be, I am not sure there would be a use case for the FULL query parameter on the List Run endpoint. You can have 1000's of tasks within a given workflow, so you would run into performance issues pretty quickly if you tried to use FULL when listing a large number of runs.

uniqueg commented 2 years ago

Fair points. I have no strong opinion on this except that I would really like to see the same solution for WES and TES. BTW, there is some discussion around filtering responses for listing tasks in TES, not sure if this discussion came up here as well. Generally I think that all of those discussions should be held between WES and TES champions and communities.

patmagee commented 2 years ago

@uniqueg not exactly the same use case, but I just created this pr, to handle large task lists for a specific workflow

patmagee commented 2 years ago

@uniqueg I think perfect consistency here between TES and WES will be a challenge. Implementations maybe have some but not all of the information available to them when listing all of the runs. In order to do the FULL view, you would need need to iterate over every single object in a list and get the expanded information for that object, which could be 1000's or 1,000,000's of tasks each.

IMO for WES, some sort of minimal representation that provides enough information to be useful, but not enough information to take the place of the GET /runs/{id} endpoint and be a hurtle to implementors would be ideal.

Maybe the current solution would be good without the tags since that is potentially an unbounded Map which could also be fairly substantial in size. That would mean we just have the right amount of information on start_time, end_time workflow_url, state and run_id. @wleepang wdyt?

uniqueg commented 2 years ago

@patmagee Fair enough. But I wasn't necessarily proposing that WES should adopt the solution currently implemented in TES (even though I tend to favor the way it's done in TES), but merely that TES people should join the discussion to come up with a common solution that works for both. Because I really don't see any fundamental differences in the problem space here: TES, just as WES, is potentially dealing with thousands or millions of jobs - in fact possibly more so than WES, given that - when TES is actually used - there is typically a one-to-many mapping of WES to TES jobs. Plus TES jobs that do not originate from workflow engines.

But to add, hopefully, something more constructive, here are a few points that I am considering:

I think there is a specific use case for GET /runs/{run_id}/status and we make use of it a lot (see reply above: https://github.com/ga4gh/workflow-execution-service-schemas/pull/172#issuecomment-951971821); for that use case (and also for semantic reasons), I don't see why that endpoint should supply more information than the run_id and the state
This is different for GET /runs, where I totally agree that it doesn't serve many use cases as it is implemented now
So to keep GET /runs/{run_id}/status doing its thing while extending the usefulness of responses to GET /runs, I would propose to create a different schema for GET /runs, one that includes the information as outlined in this PR.
From a user perspective though, I see tags as being one of the more useful - if not the most useful - fields; if I am filtering a list of potentially thousands of entries, workflow_url, state and run_id aren't perhaps overly helpful. I'd rather like to filter for tags that I may have provided just for the purpose of being able to find my workflow run again - even if it is, as you say, potentially unbound.
I would further like to see at least the workflow_type included.
One possibility would be to indeed have different levels (indicated by a query parameter as in TES), but drop the FULL view for performance reasons. So perhaps have a MINIMAL view with the fields you are listing, and a BASIC view that incluces tags, workflow_type and maybe the same fields from the task_logs.

Tagging the TES co-champions here: @kellrott @aniewielska