Feature request: add `started`, ... to job metadata

soxofaan commented 3 months ago

job metadata at GET /jobs/{job_id} currently lists these timestamps:

created (required): Date and time of creation
updated (optional): Date and time of the last status change

This is a feature request to add

started (optional): date and time when the job was started (POST /jobs/{job_id}/results)
stopped (optional): date and time when job stopped running (because of reaching status finished/error/canceled)

Context: we are handling some larger openEO use cases where a significant number of jobs has to be managed. We noticed that the "created" timestamp is not always a very informing aspect, while a "started" timestamp would be more relevant. For example because the jobs are created in bulk in advance, while they are started over a longer period, possibly hours or days after creation.

soxofaan commented 3 months ago

FYI: I'm willing to create a PR for this (should be pretty straightforward I guess). Unless there are objections to the idea in general

soxofaan commented 3 months ago

cc @HansVRP

HansVRP commented 3 months ago

sounds excellent. Is 'started' then called when running?

soxofaan commented 3 months ago

After discussing this some more, it might be more useful and scalable to not add toplevel timestamps, but a "timeline" construct to keep track of various lifetime events of a batch jobs, e.g. (added comments are for illustration)

  "timeline": [
    ["created", "2017-01-01T09:32:12Z"],
    ["started", "2017-01-05T12:34:56Z"],   # user started job 4 days after creation
    ["queued", "2017-01-05T12:35:01Z"],  # reached status "queued" 5s later
    ["running", "2017-01-05T12:39:10Z"],  # reached status "running" after 4 minutes
  ],

Note that I did not define the timeline here as a mapping object, but as an array/list of tuples: it has an explicit order, and it supports repeating an event if that is necessary (e.g. restarting a job).

m-mohr commented 1 month ago

This sounds like a simplified version of the logs to me, so I'm a bit sceptical. You can already express that in a human-readable way in the logs through the log timestamps and corresponding messages.

soxofaan commented 1 month ago

my proposal at https://github.com/Open-EO/openeo-api/issues/542#issuecomment-2247545202 is a lot more primitive than logs. It's just a list of event-timestamp pairs (events could be predefined enum). It's small data, so can be easily included directly in job metadata, no need for extra endpoint like logs.

But it doesn't have to be that listing, the initial question is about how to include the actual start and stop time of jobs (in addition to create time and "last status change" time)

m-mohr commented 1 week ago

What's the usecase for having start and stop time? Or is it actually to the effective runtime (stop - start) that you want to get? Usually updated should be the stop time (after execution has finished), although that may differ if you make changes to the metadata of the job afterwards.

soxofaan commented 1 week ago

From our end, there are multiple use cases:

execution benchmarking and profiling in the context of algorithm hosting (e.g. APEx and related use cases). Here you want to build insights/stats on how long jobs are queued before running, how long they run untile failure or success, ...
large scale client-side batch job management. E.g as a user I want to run hundreds/thousands of job, but max a handful in parallel. But to manage my resources/credits I want to be able to kill runaway jobs.

One could get this info from actively polling the job status and checking status transitions, but if you want decent time resolution you would be forced to spam the back-end with status polling requests. However, the back-end probably has full, exact view on the lifecycle of a batch job anyway, so it feels like a waste to try to guess all this from the client side.

Usually updated should be the stop time

The problem with updated is that it is just about time of the last status change, so if you didn't poll in time, you might have missed the info you're after. Differently put, it forces the user to spam the backend with status requests if they want more precise insights

m-mohr commented 1 week ago

Just trying to understand things better right now, to get to a good solution...

First use case: Does this need to be exposed publicly though? It seems that this can be done internally.

Second use case: That's what budget was meant for, but it's specified in the currency of the backend, not in time (unless the currency is time). Isn't the actual number of consumed resources (as reported in usage - is that "live"?) more meaningful here? The plain time doesn't necessarily have any relation to the credits.

soxofaan commented 6 days ago

First use case: Does this need to be exposed publicly though? It seems that this can be done internally.

We'd prefer to decouple the benchmarking system from the particular backends-under-test here, and use standardized metadata/reporting instead of having to invent, implement and maintain some reporting backchannel for each possible backend-under-test.

Isn't the actual number of consumed resources (as reported in usage - is that "live"?) more meaningful here?

credit/cost consumption is indeed important to users, but so is time consumption. Both are relevant. And they are relevant in different context: credit consumption is for the long term big picture view: "how much will my application cost each month?"; while time consumption is important now: "it feels my jobs are slow at the moment".

The plain time doesn't necessarily have any relation to the credits.

Indeed that's the point of this feature request: not to replace credits/budget, but to add insights about the timing of the job

Open-EO / openeo-api

Feature request: add `started`, ... to job metadata #542