HubSpot / Singularity

Scheduler (HTTP API and webapp) for running Mesos tasks—long running processes, one-off tasks, and scheduled jobs. #hubspot-open-source
http://getsingularity.com/
Apache License 2.0
822 stars 188 forks source link

Health check not running for ON_DEMAND task #1964

Open bmerry opened 5 years ago

bmerry commented 5 years ago

I've just started experimenting with Singularity, so apologies if I've just misunderstood how it all works.

I've created a deploy for an ON_DEMAND request with the following health check fields:

"deployHealthTimeoutSeconds": 60,
"healthcheckUri": "/health",
"healthcheckPortIndex": 1,
"healthcheckMaxTotalTimeoutSeconds": 60,

After creating a run I can see the task in the UI, where the health check section says

Beginning when Task enters running, wait a max of 45s for app to start responding, then hit /health with a 5 second timeout every 5 second(s) until: HTTP 200 is recieved

followed by a dashed box with the text "No healthchecks". The HTTP access logs for the task don't show any hits on the /health endpoint. When querying /api/tasks/ids/request/REQUEST_NAME the task shows up in notYetHealthy. After 10 minutes it's killed with the message "OVERDUE_NEW_TASK - Task did not become healthy after 10:00.000".

If I click on the "/health" link in the UI it shows a correct health page, which gives me some confidence that I've got the port mapping set up right.

I'm using a local docker-compose setup for testing, with the following images:

I'm using the Docker containerizer with BRIDGE networking and not using the Singularity executor, in case that makes a difference.

ssalinas commented 5 years ago

For ON_DEMAND tasks we don't actually run health checks, as it doesn't really have any bearing on a oneoff tasks. I realize the UI is likely confusing here and that's something we can fix (the backend currently doesn't stop you from specifying those options even if they aren't being used). Heathchecks are only run for worker/service types, where we would need to know if something is healthy. e.g. ensure replacement instance is healthy before shutting an old one down

bmerry commented 5 years ago

Ok, I can see the argument for not running the health check on ON_DEMAND tasks. I'm using Singularity in a slightly odd way, which is why I trying to define a health check, but I've got alternative tools I can use to monitor health.

Perhaps the API should prevent the checks being defined in the first place, to stop people like me from shooting themselves in the foot? Or perhaps they should be fully ignored, so that the task doesn't get killed 10 minutes later due to not having become healthy?

ssalinas commented 5 years ago

Oh, read over the fact that it got killed after 10 mins. Will have to take a closer look at that

ssalinas commented 5 years ago

For the moment though I'd recommend what you said about doing health monitoring in a different way. As an aside, what type of use case do you have for an on demand with health checks? Seems to me that anything long running with health checks should be a worker/service instead anyways

bmerry commented 5 years ago

It's part of the software for a large radio telescope. Each observation is managed by one of these jobs, which typically last for a few hours to a day. If one fails, it shouldn't be automatically restarted because higher-level systems have to deal with the failure and rescheduling, which is why I didn't use a worker/service.

In theory it could probably persist state and pick up the pieces if it died and was automatically restarted, but it's not been a priority.