aws / aws-xray-daemon

The AWS X-Ray daemon listens for traffic on UDP port 2000, gathers raw segment data, and relays it to the AWS X-Ray API.
Apache License 2.0
191 stars 69 forks source link

Single XRay Daemon in ECS Cluster not sending traces #53

Closed mauroartizzu closed 1 year ago

mauroartizzu commented 4 years ago

Hello,

Just like #24 I am trying to deploy a single xray daemon to serve every service in my ECS Cluster.

If I deploy the agent as a container in the service everything works fine. If I try to deploy it as a separate service (behind Load Balancer and tied to Route53) my services are unable to send segments to the daemon.

I can correctly see

[Debug] Send xx telemetry record(s)

when the agent is inside the service configured to have AWS_XRAY_DAEMON_ADDRESS=xray-agent:2000

But if I try to reach it outside the service as a separate microservice I only get

[Debug] Skipped telemetry data as no segments found

using AWS_XRAY_DAEMON_ADDRESS=xray.myenvironment.mydomain:2000

Consider that that host is reachable from my local machine, from inside the ec2 host and from inside the specific container. So it's not a network issue.

The task has the same IAM policy attached and the security group allows all my vpc to reach port 2000 via udp/tcp

 - PolicyName: xray-writeonly
          PolicyDocument:
            Statement:
              - Action:
                  - "xray:PutTraceSegments"
                  - "xray:PutTelemetryRecords"
                Effect: "Allow"
                Resource:
                  - "*"

And this is the CloudFormation template

- Name: xray-agent
          Essential: true
          Image: amazon/aws-xray-daemon
          Cpu: 32
          Command:
            - --log-level=dev
          Memory: 64
          PortMappings:
            - ContainerPort: 2000
              HostPort: 0
              Protocol: udp
          Environment:
            - Name: AWS_DEFAULT_REGION
              Value: !Ref "AWS::Region"
            - Name: AWS_REGION
              Value: eu-west-1
            - Name: AWS_SDK_LOAD_CONFIG
              Value: "1"
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-group: !Ref AWS::StackName
              awslogs-region: !Ref AWS::Region
              awslogs-stream-prefix: "xray-agent"

Considering it's not outputting any error log it's difficult for me to debug this.

thanks

willarmiros commented 4 years ago

Hi @mauroartizzu, Thank you for raising this issue. I'll be looking into it with the ECS team to determine if communication via UDP in a setup like yours is possible.

To be clear, this is my understanding of your setup, please correct me if I'm wrong:

ALB (xray.host.com) -> ECS (daemon)
  ^          
   \____________________ (send segments via UDP to xray.host.com:2000)
                        \
ALB (host.com)  -----> ECS (your service)
willarmiros commented 4 years ago

One potential problem is that I do not see you setting the UDP or TCP address that the daemon should be listening on, so it will be listening to localhost:2000 by default. See the docs for how to configure this binding.

Also, you are setting the AWS_XRAY_DAEMON_ADDRESS in your application containers where the X-Ray SDK is running right?

mauroartizzu commented 4 years ago

I just discovered via TCP it only acts as a proxy to SDK API Calls it does not accept the same payload ad UDP.

The problem is afaik ALB only accepts TCP and HTTP, correct me if I am wrong.

I should move to NLB which is not possibile at the moment.

the binding was set up correctly both with --bind and --bind-tcp I tried all the combinations, even the deamon address variable was set up correctly. I also modified my socket creation part to send TCP instead of UDP, it connects fine but the problem is as I said it's not accepting the same payload.

I think the problem is just this: ALB rejects UDP calls and the deamon in TCP only acts as a proxy to the standard client. So for the moment I am stuck making calls through the sdk like I was doing before, instead of using a dedicated agent.

mauroartizzu commented 4 years ago

The setup was right. the same ALB serves me the microservices via TCP port 80 and the daemon via UDP 2000.

I should open up a dedicated NLB just for xray, it might solve the issue, but it's cost prohibitive for us to use a dedicated load balancer just for the agent

willarmiros commented 4 years ago

Ah I see, yes according to the FAQs it appears only NLBs accept UDP connections and ALB does not. The daemon uses TCP and UDP connections for different purposes as you noticed. The daemon accepts (sub)segments from the SDKs over UDP, but it sends them to the X-Ray backend over TCP. You would not be able to change the emitter on the SDK to use TCP, since the daemon only accepts (sub)segments on UDP.

It would appear the only solutions are to either have the daemon within the same service or change the daemon "front-end" to NLB.

willarmiros commented 4 years ago

Hi @mauroartizzu, After discussing with the ECS team, they've recommended you should configure the X-Ray daemon using the daemon service scheduler type, as described in these docs: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_services.html

This way, you don't need to bother with the costs & configuration associated with deploying the daemon as a standalone service behind a load balancer. It deploys one daemon per container instance automatically, which means all your tasks will be able to communicate with the daemon on their instance using localhost as intended.

I apologize for not having further documentation on the X-ray specific use case, this feature was released after the X-Ray and ECS integration so we will be working on documentation in the future.

mauroartizzu commented 4 years ago

@willarmiros that was the 1st path I took and it worked seamlessly. The fact is we have hundreds of microservices and even keeping memory settings low will result in a hundred daemons. We tried but we were growing the number of necessary EC2 per cluster.

Anyway thank you a lot for the clarifications :) Now I surely have a better understanding on how the daemon works and this issue might serve as a solution for anyone else facing the same problem.

willarmiros commented 4 years ago

@mauroartizzu Glad to hear! I'll close this issue then, feel free to recomment if you have further difficulties.

Stexxen commented 4 years ago

From Comment - https://github.com/aws/aws-xray-daemon/issues/53#issuecomment-647820079

After discussing with the ECS team, they've recommended you should configure the X-Ray daemon using the daemon service scheduler type, as described in these docs: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_services.html

This way, you don't need to bother with the costs & configuration associated with deploying the daemon as a standalone service behind a load balancer. It deploys one daemon per container instance automatically, which means all your tasks will be able to communicate with the daemon on their instance using localhost as intended.

That method only appears to be appropriate if you are not using fargate.

As I am using fargate, what is the prefered strategy for Multi A-Z? My thoughts go down the route of creating another service with the xray-daemon as the container, and have at least 1 instance per Zone, but we would still need that behind a ELB incase the node in a specific zone died.

Does the ECS team have other recommendations for X-ray daemons with Fargate containers?

ybron commented 3 years ago

Running XRay as a daemon on ECS seems like it would invite trouble as once the underlying EC2 hosts enters the DRAINING state, it will kill off the XRay daemon task, leaving any still-running (but exiting/stopping) tasks failing to send trace information. I briefly considered running a couple of other tasks as daemons instead of sidecars after reading this, but if they all die off immediately upon draining, that's a non-starter.

ryanmorseglu commented 3 years ago

Hi @Stexxen any luck finding a solution for Fargate containers? I'm still trying to track this down myself. Thanks!

Stexxen commented 3 years ago

@ryanmorseglu Unfortunately not.

@willarmiros I think the ticket should be reopened on the basis that it requires AWS to revisit the solution they recommended as it doesn't work with fargate.

ryanmorseglu commented 3 years ago

To clarify our use case, I'm trying to configure our cloudformation yaml such that we setup 1 xray daemon instance/task in the cluster, and can have many other server instances/tasks send trace data to the single daemon. I've not found a Fargate solution yet (we're using awsvpc networking mode). I have that working in a local-docker setup on my Windows machine. The breakdown occurring on Fargate/ECS seems to be that the services cannot translate "xray-daemon:2000" into an IP, which works fine within local-docker. (That "xray-daemon:2000" is from the ENV var AWS_XRAY_DAEMON_ADDRESS.) I even tried using the private IP of the xray-daemon, as I read in the docs that it should work for task communication on Fargate, but the result did not change. For now, we're running the side-car method where every service task runs its own xray daemon that can be accessed via "localhost:2000". Thanks all!

KeynesYouDigIt commented 3 years ago

I am running sidecar (daemon on localhost, and I double checked that the emitter is looking there correctly) and I can get my application to run no problem, but I am not seeing any of my segments show up in the x-ray console. Whats the best way to debug this? can I set the recorder to throw an exception if it doesnt reach the right service? What else should I be looking at?

Heres the recorder config, its the only setup I do on the recorder so I wonder if its missing a piece

(its running in ECS Fargate, it does not send anything to xray when I build and run locally either)


xray_recorder.configure(
    service='myecstask',
    sampling_rules={
        "version": 1,
        "default": {
            # record the first 2 requests made in a second.
            "fixed_target": 2,
            # Samling rate AFTER first 2 requests per second.
            "rate": 0.1,
        },
    }
)```
willarmiros commented 3 years ago

The breakdown occurring on Fargate/ECS seems to be that the services cannot translate "xray-daemon:2000" into an IP, which works fine within local-docker

@ryanmorseglu That's right. This is because Fargate tasks must be run in awsvpc networking mode, which doesn't understand links like bridge networking mode does. In our docs, under "Example VPC task definition" you can find an example of setting up the X-Ray daemon in a task definition that omits the use of links. Please let me know if that satisfies your use case.

willarmiros commented 3 years ago

@KeynesYouDigIt I'd suggest taking a look at the example task definition I referenced above and ensuring you're not using links. Another thing to check is that your task role has write permission for the X-Ray service. If you're still having issues please open a separate GitHub issue and post:

  1. Your task definition
  2. Your X-Ray daemon logs
  3. Your application logs with X-ray debug mode enabled
ryanmorseglu commented 3 years ago

Hi @willarmiros , thanks for the info but I had already seen that. That's actually the side-car approach--two containers in one task definition--which is what I'm using for the moment, but is not the goal / desired usage. We would prefer to run a single, isolated xray daemon task for the cluster, and then N servers utilize it, rather than 1:1 daemon-to-server-instance ratio. Cheers!

Stexxen commented 3 years ago

@willarmiros I second @ryanmorseglu comments and adding a side-car to every task will help AWS's bottom line, but not ours ;-)

For reference my original query is below, which aligns with what Ryan is asking.

As I am using fargate, what is the prefered strategy for Multi A-Z? My thoughts go down the route of creating another service with the xray-daemon as the container, and have at least 1 instance per Zone, but we would still need that behind a ELB incase the node in a specific zone died.

KeynesYouDigIt commented 3 years ago

I think permissions were my issue, is there a log in CloudTrail or something similar where I would see errors? Feels like its failing very silently but that might be my own lack of knowledge.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs in next 7 days. Thank you for your contributions.