Closed mauroartizzu closed 1 year ago
Hi @mauroartizzu, Thank you for raising this issue. I'll be looking into it with the ECS team to determine if communication via UDP in a setup like yours is possible.
To be clear, this is my understanding of your setup, please correct me if I'm wrong:
ALB (xray.host.com) -> ECS (daemon)
^
\____________________ (send segments via UDP to xray.host.com:2000)
\
ALB (host.com) -----> ECS (your service)
One potential problem is that I do not see you setting the UDP or TCP address that the daemon should be listening on, so it will be listening to localhost:2000
by default. See the docs for how to configure this binding.
Also, you are setting the AWS_XRAY_DAEMON_ADDRESS
in your application containers where the X-Ray SDK is running right?
I just discovered via TCP it only acts as a proxy to SDK API Calls it does not accept the same payload ad UDP.
The problem is afaik ALB only accepts TCP and HTTP, correct me if I am wrong.
I should move to NLB which is not possibile at the moment.
the binding was set up correctly both with --bind and --bind-tcp I tried all the combinations, even the deamon address variable was set up correctly. I also modified my socket creation part to send TCP instead of UDP, it connects fine but the problem is as I said it's not accepting the same payload.
I think the problem is just this: ALB rejects UDP calls and the deamon in TCP only acts as a proxy to the standard client. So for the moment I am stuck making calls through the sdk like I was doing before, instead of using a dedicated agent.
The setup was right. the same ALB serves me the microservices via TCP port 80 and the daemon via UDP 2000.
I should open up a dedicated NLB just for xray, it might solve the issue, but it's cost prohibitive for us to use a dedicated load balancer just for the agent
Ah I see, yes according to the FAQs it appears only NLBs accept UDP connections and ALB does not. The daemon uses TCP and UDP connections for different purposes as you noticed. The daemon accepts (sub)segments from the SDKs over UDP, but it sends them to the X-Ray backend over TCP. You would not be able to change the emitter on the SDK to use TCP, since the daemon only accepts (sub)segments on UDP.
It would appear the only solutions are to either have the daemon within the same service or change the daemon "front-end" to NLB.
Hi @mauroartizzu, After discussing with the ECS team, they've recommended you should configure the X-Ray daemon using the daemon service scheduler type, as described in these docs: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_services.html
This way, you don't need to bother with the costs & configuration associated with deploying the daemon as a standalone service behind a load balancer. It deploys one daemon per container instance automatically, which means all your tasks will be able to communicate with the daemon on their instance using localhost
as intended.
I apologize for not having further documentation on the X-ray specific use case, this feature was released after the X-Ray and ECS integration so we will be working on documentation in the future.
@willarmiros that was the 1st path I took and it worked seamlessly. The fact is we have hundreds of microservices and even keeping memory settings low will result in a hundred daemons. We tried but we were growing the number of necessary EC2 per cluster.
Anyway thank you a lot for the clarifications :) Now I surely have a better understanding on how the daemon works and this issue might serve as a solution for anyone else facing the same problem.
@mauroartizzu Glad to hear! I'll close this issue then, feel free to recomment if you have further difficulties.
From Comment - https://github.com/aws/aws-xray-daemon/issues/53#issuecomment-647820079
After discussing with the ECS team, they've recommended you should configure the X-Ray daemon using the daemon service scheduler type, as described in these docs: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_services.html
This way, you don't need to bother with the costs & configuration associated with deploying the daemon as a standalone service behind a load balancer. It deploys one daemon per container instance automatically, which means all your tasks will be able to communicate with the daemon on their instance using
localhost
as intended.
That method only appears to be appropriate if you are not using fargate.
As I am using fargate, what is the prefered strategy for Multi A-Z? My thoughts go down the route of creating another service with the xray-daemon as the container, and have at least 1 instance per Zone, but we would still need that behind a ELB incase the node in a specific zone died.
Does the ECS team have other recommendations for X-ray daemons with Fargate containers?
Running XRay as a daemon on ECS seems like it would invite trouble as once the underlying EC2 hosts enters the DRAINING
state, it will kill off the XRay daemon task, leaving any still-running (but exiting/stopping) tasks failing to send trace information. I briefly considered running a couple of other tasks as daemons instead of sidecars after reading this, but if they all die off immediately upon draining, that's a non-starter.
Hi @Stexxen any luck finding a solution for Fargate containers? I'm still trying to track this down myself. Thanks!
@ryanmorseglu Unfortunately not.
@willarmiros I think the ticket should be reopened on the basis that it requires AWS to revisit the solution they recommended as it doesn't work with fargate.
To clarify our use case, I'm trying to configure our cloudformation yaml such that we setup 1 xray daemon instance/task in the cluster, and can have many other server instances/tasks send trace data to the single daemon. I've not found a Fargate solution yet (we're using awsvpc networking mode). I have that working in a local-docker setup on my Windows machine. The breakdown occurring on Fargate/ECS seems to be that the services cannot translate "xray-daemon:2000" into an IP, which works fine within local-docker. (That "xray-daemon:2000" is from the ENV var AWS_XRAY_DAEMON_ADDRESS.) I even tried using the private IP of the xray-daemon, as I read in the docs that it should work for task communication on Fargate, but the result did not change. For now, we're running the side-car method where every service task runs its own xray daemon that can be accessed via "localhost:2000". Thanks all!
I am running sidecar (daemon on localhost, and I double checked that the emitter is looking there correctly) and I can get my application to run no problem, but I am not seeing any of my segments show up in the x-ray console. Whats the best way to debug this? can I set the recorder to throw an exception if it doesnt reach the right service? What else should I be looking at?
Heres the recorder config, its the only setup I do on the recorder so I wonder if its missing a piece
(its running in ECS Fargate, it does not send anything to xray when I build and run locally either)
xray_recorder.configure(
service='myecstask',
sampling_rules={
"version": 1,
"default": {
# record the first 2 requests made in a second.
"fixed_target": 2,
# Samling rate AFTER first 2 requests per second.
"rate": 0.1,
},
}
)```
The breakdown occurring on Fargate/ECS seems to be that the services cannot translate "xray-daemon:2000" into an IP, which works fine within local-docker
@ryanmorseglu That's right. This is because Fargate tasks must be run in awsvpc
networking mode, which doesn't understand links like bridge
networking mode does. In our docs, under "Example VPC task definition" you can find an example of setting up the X-Ray daemon in a task definition that omits the use of links. Please let me know if that satisfies your use case.
@KeynesYouDigIt I'd suggest taking a look at the example task definition I referenced above and ensuring you're not using links. Another thing to check is that your task role has write permission for the X-Ray service. If you're still having issues please open a separate GitHub issue and post:
Hi @willarmiros , thanks for the info but I had already seen that. That's actually the side-car approach--two containers in one task definition--which is what I'm using for the moment, but is not the goal / desired usage. We would prefer to run a single, isolated xray daemon task for the cluster, and then N servers utilize it, rather than 1:1 daemon-to-server-instance ratio. Cheers!
@willarmiros I second @ryanmorseglu comments and adding a side-car to every task will help AWS's bottom line, but not ours ;-)
For reference my original query is below, which aligns with what Ryan is asking.
As I am using fargate, what is the prefered strategy for Multi A-Z? My thoughts go down the route of creating another service with the xray-daemon as the container, and have at least 1 instance per Zone, but we would still need that behind a ELB incase the node in a specific zone died.
I think permissions were my issue, is there a log in CloudTrail or something similar where I would see errors? Feels like its failing very silently but that might be my own lack of knowledge.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs in next 7 days. Thank you for your contributions.
Hello,
Just like #24 I am trying to deploy a single xray daemon to serve every service in my ECS Cluster.
If I deploy the agent as a container in the service everything works fine. If I try to deploy it as a separate service (behind Load Balancer and tied to Route53) my services are unable to send segments to the daemon.
I can correctly see
when the agent is inside the service configured to have AWS_XRAY_DAEMON_ADDRESS=xray-agent:2000
But if I try to reach it outside the service as a separate microservice I only get
using AWS_XRAY_DAEMON_ADDRESS=xray.myenvironment.mydomain:2000
Consider that that host is reachable from my local machine, from inside the ec2 host and from inside the specific container. So it's not a network issue.
The task has the same IAM policy attached and the security group allows all my vpc to reach port 2000 via udp/tcp
And this is the CloudFormation template
Considering it's not outputting any error log it's difficult for me to debug this.
thanks