aws / aws-xray-daemon

The AWS X-Ray daemon listens for traffic on UDP port 2000, gathers raw segment data, and relays it to the AWS X-Ray API.
Apache License 2.0
189 stars 69 forks source link

Add health endpoint #9

Open Mortinke opened 6 years ago

Mortinke commented 6 years ago

Referring to the pinging the xray daemon forum post, it would be nice, if a health endpoint could be added for the X-Ray daemon.

Currently, we use a crontab bash script in the ECS launch configuration to check the status of the X-Ray daemon. Instead, we would like to use a health endpoint.

yogiraj07 commented 6 years ago

Hi @Mortinke , We appreciate your feature request. I would add this to our backlog item. We also welcome you to submit PR. Thanks for patience and stay tuned.

Best, Yogi

Sweathered commented 5 years ago

+1 for this. I've written a stack overflow question as well with a similar need: https://stackoverflow.com/questions/54119916/how-to-create-a-health-check-for-xray-daemon-task

jason-riddle commented 5 years ago

In addition to the great answer from @Sweathered's stack overflow post, I've been instead doing the following. I have the following in my docker-compose.yml.

docker-compose.yml

version: '3.5'

x-global: &global
  AWS_ACCESS_KEY_ID: AKIAIOSFODNN7EXAMPLE
  AWS_SECRET_ACCESS_KEY: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
  AWS_REGION: us-east-1

services:
  xray:
    image: amazon/aws-xray-daemon
    command: --config /xray-daemon.yml
    hostname: xray
    environment:
      <<: *global
    ports:
      - 2000:2000/udp
    volumes:
      - ./integration/xray/xray-daemon.yml:/xray-daemon.yml:ro
    healthcheck:
      test: timeout 1 /bin/bash -c '</dev/tcp/localhost/2000 && </dev/udp/localhost/2000'
      interval: 2s
      timeout: 2s
      retries: 1

I'm relying on bash native features (timeout built-in, opening tcp/udp socket capability) to do the health check. Compared to the solution from stackoverflow

CMD-SHELL, netstat -aun | grep 2000 > /dev/null; if [ 0 != $? ]; then exit 1; fi;

I'd rather not have to depend on netstat and grep being available, having to pipe since stderr is dropped, and write a single line if statement that is mostly readable, but it takes longer than a second to understand. However, it would be better if the xray daemon supported this naturally through maybe a /ready or /healthz endpoint because even though you can connect to the tcp/udp socket, that does not necessarily mean everything is functioning as expected since there may be some internal issue going on (running low on space, high number of goroutines running, connectivity issue to AWS, etc) or things are still starting up even after the socket is ready to start accepting connections.

ghost commented 4 years ago

I am starting to wonder if it is possible to run X-Ray Daemon as a separate service in ECS Fargate without paying for a coupled container just to do the health check? Disabling health check doesn't sound safe.

shengxil commented 4 years ago

Hi IrmantasM,

There is a workaround: scan the x-ray daemon log. The daemon sends a telemetry data per minute. If it is still running, you will find either 2020-03-16T19:10:43Z [Debug] Send 1 telemetry record(s) or 2020-03-16T19:11:43Z [Debug] Skipped telemetry data as no segments found per minute. Basically it is stopped if there's no new log entry within 1 minute.

Tro95 commented 4 years ago

Hi IrmantasM,

There is a workaround: scan the x-ray daemon log. The daemon sends a telemetry data per minute. If it is still running, you will find either 2020-03-16T19:10:43Z [Debug] Send 1 telemetry record(s) or 2020-03-16T19:11:43Z [Debug] Skipped telemetry data as no segments found per minute. Basically it is stopped if there's no new log entry within 1 minute.

An issue that occurred with AWS yesterday prevented all my fargate tasks in eu-west-1 from writing logs and metrics to Cloudwatch, despite the essential containers within the tasks remaining healthy, my services staying operational, and my xray traces still being received correctly. Relying on the presence of logs seems like a bad way to perform a health check, and in this scenario was a false positive. It would be nice if the xray daemon had an actual healthcheck endpoint rather than relying on external dependencies that can break, as well as integrating much better into fargate.

wangzlei commented 4 years ago

Hi Tro95,

Does the workaround in https://stackoverflow.com/questions/54119916/how-to-create-a-health-check-for-xray-daemon-task suit for you? At present X-Ray Daemon does not not provide the health check so the possible way is to ping port 2000 by ECS HealthCheck. There is another security concern that X-Ray Daemon does not want to provide a public HTTP endpoint. If X-Ray provides a health check only accept Http request from localhost, it still have to bind with ECS HealthCheck.

Thanks.

Tro95 commented 4 years ago

Hi Tro95,

Does the workaround in https://stackoverflow.com/questions/54119916/how-to-create-a-health-check-for-xray-daemon-task suit for you? At present X-Ray Daemon does not not provide the health check so the possible way is to ping port 2000 by ECS HealthCheck. There is another security concern that X-Ray Daemon does not want to provide a public HTTP endpoint. If X-Ray provides a health check only accept Http request from localhost, it still have to bind with ECS HealthCheck.

Thanks.

The StackOverflow solution involves building and maintaining my own image, which I'd prefer not to do. I would be fine having a healthcheck endpoint only available to localhost, because AWS X-Ray can make use of the HEALTHCHECK Dockerfile command which ECS can use.

nathanpeck commented 3 years ago

As of June 2021 I have discovered that the X-Ray image now ships with just a statically linked binary. It no longer has a full environment so there is no shell to execute health checks anymore. The only way to healthcheck will be from externally, you can't run commands inside the container anymore

cageyv commented 3 years ago

I tried different options At the moment, we can call the xray utility itself This example will make container "forever healthy" and just verify that xray exist

If we add any --status flag to this and display the state of the application, we will get ability to check the app

version: "3.9"
services:
  xray:
    image: public.ecr.aws/xray/aws-xray-daemon:3.3.2
    command: --local-mode --log-level warn --region eu-central-1
    healthcheck:
      test: ["CMD", "/xray", "--version", "||", "exit 1"]
      interval: 5s
      timeout: 2s
      retries: 3
      start_period: 5s
      "healthCheck": {
        "retries": 3,
        "command": ["CMD", "/xray", "--version", "||", "exit 1"],
        "timeout": 2,
        "interval": 5,
        "startPeriod": 10
      }
stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs in next 7 days. Thank you for your contributions.

Andrey9kin commented 2 years ago

still relevant

JohnPreston commented 2 years ago

Great ideas to see the port being opened etc. but I would side with having a proper way to probe that xray-daemon is running properly. CURL etc. might be one way, although if there were a xray command option to probe for existing running processes that might be better ? similar to --status mentioned above. i.e. in Python if you run apps within supervisord you can ask that to tell you whether the tasks are running and healthy.

mauritz-lovgren commented 1 year ago

The AWS OpenTelemetry alternative can be used for both x-ray and metrics and it has a health check endpoint. But I have some issues with the x-ray portion of it just now, forcing me back to the 'AWS native' x-ray daemon image.

StefanPrintezis commented 11 months ago

Another workaround; Following the xray daemon ecs build guide, i extended it by adding nc with yum.

Healthcheck on task def looks like this:

command: ['CMD-SHELL', 'nc -z localhost 2000 || exit 1'],