Query by task or by request?

HubSpot / Singularity

Scheduler (HTTP API and webapp) for running Mesos tasks—long running processes, one-off tasks, and scheduled jobs. #hubspot-open-source

http://getsingularity.com/

Apache License 2.0

823 stars 188 forks source link

Query by task or by request? #2193

Closed mikebell90 closed 3 years ago

mikebell90 commented 3 years ago

In the end my goal is to simply check "Is a specific Task in the running state, and for how long"

I have traditionally queried by RequestId, and then filtered down to the specific Task (eg /request/{requestId}

Is there a gain, particularly with say 50-100+ instances of doing a direct query by Task (/task/{taskId} ? That endpoint returns a lot more data, so it seemed non obvious which would be more performant?

ssalinas commented 3 years ago

Depends if you are interested in all of the tasks' state or just a single. If just one, could always use the task state endpoint. If getting all of them for a request at once can use the task ids by status call for the most efficiency there

mikebell90 commented 3 years ago

So let me be clearer as to the goal:

I want to check that a specific task belonging to a specific request has the following attributes
Exists, meaning the task is Running, or Cleaning. Lost we don't want, and Pending is probably ok to avoid two.
The deploy timestamp. (This is actually wrong for our purposes, but we've made do. It doesn't take into account a long time to schedule or a decomission)
We only care about this task

In plain english "Is this task in a running or cleaning state, belongs to a specific request Id and has it started in the last N minutes"

try (Stream<SingularityTaskId> singularityTaskIdStream = Stream.of(
                singularityTaskIdsByStatus.getHealthy().stream(),
                singularityTaskIdsByStatus.getCleaning().stream(),
                singularityTaskIdsByStatus.getNotYetHealthy().stream()).flatMap(t -> t)) {

Taking that steam and filter it. Currently we use "/api/tasks/ids/request/%s" but we are finding this is showing increasingly bad performance as we've scaled the cluster up.

ssalinas commented 3 years ago

Ok, that first task state (/api/track/task/{taskid}) call is likely what you want then. It will check the singular task's data, not the whole request, falling back to task history if it isn't in the active data (if you are using mysql for history). We created the /track api for use cases like that since the regular task api is split into active vs history which can make it hard to work with

mikebell90 commented 3 years ago

Hmm. Well, maybe not. I'm consistently getting WORSE performance that way.

Let me explain the scenario. This is running a test, appriximately 4-5 requests on the minute interval (5 instances, each with a fixedRate of 1 minute, single threaded executor)

The query in the end takes an input of taskId and requestId. For the old call it took this: g /request/{requestId}, and then looked for the matching taskId.

In the new one, it queries /api/track/task/{taskid}) and then looks to see if it has a matching requestId.

What I'm finding in two diferent environments ( one 130 agents, one about 400), is

a) The taskId api doesn't have as many outliers, but the outliers are much worse (12-15 seconds!) b) the requestId has many more outliers, but is consistently 1-2.5 s on the outliers c) Neither is performing well compared to when it "used" to., but the facts that have changed (number of agents, etC HAVE increased) are hard to eliminate.

ssalinas commented 3 years ago

If you are using sql, have you checked the metrics for your database and/or zk cluster? We have ~700 agents, ~22k tasks, and ~13k requests and that endpoint for me is consistently sub second