NVIDIA / NVFlare

NVIDIA Federated Learning Application Runtime Environment
https://nvidia.github.io/NVFlare/
Apache License 2.0
614 stars 175 forks source link

Health check #240

Open Nintorac opened 2 years ago

Nintorac commented 2 years ago

Not sure how this will look like. Total gRPC noob, sorry.

I am trying to stand up a FL server in AWS ECS behind a load balancer, to do so requires the service has a health check and that the health check responds healthy to health probes.

Here is the CDK object I need to configure that defines how the health check is performed.

And here is gRPC docs on health checking.

Is there a) already a health check, if so how should I configure the cdk Health Check b) if not already existing is there some workaround that I can use in the meantime

Thanks!

Nintorac commented 2 years ago

I have got this working using the /fedlearn.FederatedTraining/Heartbeat path and accepting 0-99 response code as a success.

Is there a more specific number/range that indicates a healthy hearbeat or would it be better to implement a specific endpoint for health checks?

holgerroth commented 2 years ago

@yhwen, @nvidianz any comments on this one?

nvidianz commented 2 years ago

You can simply use TCP as the protocol and just check if the port is open. This works in all cases, even in TLS pass-thru mode. Heartbeat doesn't provide any more information on server's health.

We have plans to add real health check endpoint in the future releases.