aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 321 forks source link

[Fargate] [request]: offer high-performance network options #715

Open lifeofguenter opened 4 years ago

lifeofguenter commented 4 years ago

Community Note

Tell us about your request It would be great if it were possible to opt-in for an explicit high-performance network option.

Which service(s) is this request for? Fargate

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Certain services require a high networking through-put to internal services (s3/elasticache/etc.) - currently it does not seem to be possible to enforce a high network capacity.

Are you currently working around this issue? No current solution known to us.

Additional context

The following reported feature requests are in the similar range as they demand more performance out of Fargate:

Additionally the following blog post goes so far into actually measuring the networking performance: https://stormforger.com/blog/aws-fargate-network-performance/

Attachments

prameshbajra commented 4 years ago

A much needed feature for FarGate instances.

pawel-przybyla commented 3 years ago

This option is required to run time-critical , high RPS services using Fargate launch type. We have an internal .NET Core service. The C# code (including call to PostgreSQL) is executed in 2-3 ms (95 percentile). For almost 2 years it was successfully running, the average Target Response Time metric of ALB was ~4-5 ms. This week after new deployment this metric was increased to ~40-50 ms. Service C# code still executed in ~3 ms. We deployed few instances in two Regions and behavior is consistent. Target Response Time metric of ALB is ~40 ms. Based on CloudWatch metric we believe the CPU and Memory are not an issue. Our assumption is that the currently being selected host has worse network

ravishtiwari commented 2 years ago

It would be really nice to have this feature where users are able to choose or customize different options. It would make it more suitable for additional workloads that can't consistently run on Fargate atm. :+1:

KeenanLawrence commented 2 years ago

Any feedback on this? Would definitely strengthen Fargate's stance vs EC2 if it had a high-performance network option.

zachcasper commented 2 years ago

Hey folks, Zach here from the Fargate PM team. I'm working on Fargate's direction around performance. I'm very interested in knowing a few more details here. A few questions:

  1. What are some use cases for a high-performance networking option? What types of workloads specifically?
  2. Do these workloads need increased bandwidth or decreased latency?
  3. Would a more consistent network bandwidth at the current performance level solve this?
  4. Or no, and you need more than today—what minimum/baseline bandwidth would you need in Gbps?

Thanks!

lifeofguenter commented 2 years ago

Hi @zachcasper - thanks for coming by :)

What are some use cases for a high-performance networking option? What types of workloads specifically?

S3 uploads: improve transfer speeds, by design I don't think you can scale-out (within a single request) so the task handling the upload should be able to upload as fast as possible.

DynamoDB, Memcached/Redis: usually this can be scaled-out, but with high-performing micro services it can become more difficult to scale out if cpu is not the limiting factor but bandwidth.

Do these workloads need increased bandwidth or decreased latency?

In our experience latency was not an issue but available/consistent bandwidth

Would a more consistent network bandwidth at the current performance level solve this?

Maybe, but then the base-line should at least be on-par with t3/t4g.

brignolij commented 2 years ago

Hey folks, Zach here from the Fargate PM team. I'm working on Fargate's direction around performance. I'm very interested in knowing a few more details here. A few questions:

1. What are some use cases for a high-performance networking option?  What types of workloads specifically?

2. Do these workloads need increased bandwidth or decreased latency?

3. Would a more consistent network bandwidth at the current performance level solve this?

4. Or no, and you need more than today—what minimum/baseline bandwidth would you need in Gbps?

Thanks!

Hi thanks for the questions, here some suggestion from my side.

In our case , we have some issue with network with our .net 6 task running on arm64 runtime on ecs fargate.

  1. Me and my team are facing issue when bursting a container with requests who required our application to call another service. in our case, for each of theses requests , our application make a call to redis (elastic cache) or mssql(ec2) or external api. We have small container (1024 cpu, 3gbram). the cpu and ram are still not at more than 30%, but we have like 5% of our call to external services (redis, mssql, api) receive an timeout / http 0 when calling. From my understand, the network controller is saturated. fargate task with more network connection allowed whould be great. Also, we have no clue of network performance according cpu & ram configuration.

Regards

jnicholls commented 2 years ago

Thanks @zachcasper for offering attention to this request. I would like to parrot the answers from @lifeofguenter, namely that my team's use cases are similarly challenged by a lack of consistent throughput, low throughput capacity, and not really knowing what that capacity is in practice. And with a lot of the same services in mind (S3 TX/RX, DynamoDB, Redis, etc.)

I too have leaned on this experiment to understand what to expect out of Fargate. For I/O bound workloads, it would help our Fargate capacity planning to at least know what the current network capacity is at various vCPU/Mem combinations. From there, it would be great to perhaps explore allowing network performance to be a separate task capacity dimension.

To expand upon @lifeofguenter's S3 upload use case: it was mentioned that it was not possible to scale-out a request. However, it is in fact possible to scale out an object upload across available CPUs using S3's multipart upload support. The same can be accomplished for downloads with S3's Range support. One particular use case I have is reading in a lot of S3 objects, and concatenating them together into larger, fewer S3 objects. This process is strictly I/O bound, and I am having to instantiate enough Fargate tasks to accomplish the necessary I/O volume needed, leaving a lot of CPU on the table. The idea here would be to enable high-throughput networking in order to move a Fargate task from being I/O bound to being CPU bound, if I/O is its primary dimension. Right now, we're considering making more use of these tasks to fill the gap in CPU usage, but architecturally I see value in having deterministic network capacity as an option for planning Fargate capacity for various workloads.

TreeKat71 commented 2 years ago

One of reasons I would like to have better network performance is related to another issue: Fargate can not cache the image, so better network performance can definitely save some money.

fbuinosquy1985 commented 2 years ago

same as @brignolij , we have a lot of .net services that consumes rabbitmq, redis ..

brignolij commented 2 years ago

@fbuinosquy1985 we solved a lot of issue by doing some application code optimization: by optimization of http usage. check : https://learn.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-7.0 or https://restsharp.dev/v107/#restclient-and-options in one word, make your network service (http, redis) as singleton. it will avoid having too many network connections open. In our case, we fixed all our issues.

alexey-panin commented 1 year ago

Our system is the Charging Station Central System (CSMS). Its intended use is to accommodate thousands of charging station connections (websocket connections) and to serve as a medium between the end-user mobile app and physical charging stations so that users can start charging, stop charging, etc.

Each charging station even without end-user interaction is sending data, even though in small chanks but quite often, to CSMS in a form of heartbeats and meter values. Now add here a user interaction, requests from CSMS to other AWS services (RDS, dynamodb, timestream, redis) and we get a strong need for a good predictable network throughput.

CSMS is running on Fargate tasks and upon running of stress tests, while CPU and RAM being ok, upon reaching a certain amount of connections, the system starts choking leading to websocket connections timing out. Of course network throughput is not the only factor affecting system performance and we have identified other bottlenecks and code optimization needs but it is clearly one of those factors and we would want to have more visibility into it and be able to choose it for our Fargate task like we have a possibility to do so with EC2 instances. Such awareness would also allow us to roughly calculate how many charging station connections one CSMS task is capable of handling and this would give us a baseline for scaling out activities.

Thus yes, this feature is strongly supported and desired on our end.

axrj commented 1 year ago

Hi @zachcasper . It'd be amazing to have this feature as it solves a bunch of issues for us.

What are some use cases for a high-performance networking option? What types of workloads specifically?

We have an application that downloads a large dataset ~120G at task startup to create a local index. Even with the biggest fargate task size(16v/100G) it takes around 14 mins for a single task to start due to network bandwidth dropping to baseline after the credits being exhausted. The maximum I noticed was ~2.5Gbps in the first minute which quickly drops to ~1Gbps from the next minute slowing everything down. Higher network bandwidth will let the data to be downloaded and the task start much quicker. EFS is not an option as it negatively affects application latency for realtime requests.

Do these workloads need increased bandwidth or decreased latency?

Sustained higher bandwidth will definitely help this kind of applications startup faster. Lower latency would also improve our application response times as there is a lof ot service to service communication involved.

Or no, and you need more than today—what minimum/baseline bandwidth would you need in Gbps?

It would be great if picking a higher network option is allowed for Fargate tasks. If possible 5-10G links would help a great deal

gshpychka commented 1 year ago

@zachcasper we also use EC2 after evaluating Fargate and finding that networking performance was lower. In our case, we needed the lowest possible latency with tiny bandwidth, and EC2 (specifically, "host" networking") gave us around 20% less latency than Fargate even at the smallest instance sizes. If I remember correctly, using the "aws" network mode on EC2 gave us similar results as on Fargate.

bill-poole commented 1 year ago

@zachcasper, we also use EC2 after investigating Fargate EKS and finding that network bandwidth is much lower between pods than is available between pods in EKS hosted on EC2.

What are some use cases for a high-performance networking option? What types of workloads specifically?

We need the full network bandwidth that EC2 nodes provide, including the burst network bandwidth that is made available for unpredictable spikes in demand. On EC2, our nodes are usually CPU bound, but during certain load spikes, they can become temporarily network bound (for a few seconds), unless using the high-performance 100 Gbps instance types. We have pods individually exchanging over 500 MB/s with other pods and over 500 MB/s with DynamoDB and S3 for several seconds at a time when demand spikes.

Do these workloads need increased bandwidth or decreased latency?

Increased bandwidth.

Would a more consistent network bandwidth at the current performance level solve this? Or no, and you need more than today—what minimum/baseline bandwidth would you need in Gbps?

We would need the Fargate EKS network bandwidth to match what is available on EC2 -10 Gbps burst network performance, with the option of going to instance types supporting 100 Gbps.

lannyzhanggit commented 1 year ago
  1. What are some use cases for a high-performance networking option? What types of workloads specifically?

our application utilizing hazelcast clustering between container pods/task that sharing cache and files, we need high speed network between the nodes.

  1. Do these workloads need increased bandwidth or decreased latency? we need increased bandwidth

  2. Would a more consistent network bandwidth at the current performance level solve this? no, our performance testing showing, when running in Fargate, we hitting a bottleneck when reach certain number of tasks, from there regardless of how many tasks with different CPU/RAM combination, it won't incease the perfromance. we use m5.18xlarge instances in production for this purpose.

  3. Or no, and you need more than today—what minimum/baseline bandwidth would you need in Gbps? we cannot test the baseline as Fargate currently don't provide, but similar to EC2, 10GB probably a starting point.

joshuahiggins commented 1 year ago

I found this issue through the same article referenced above. We've been chasing a series of random SIGKILL responses on one of our Fargate instances during periods of high output.

Our code is running on a 1 vCPU / 2 GB mem instance that's pushing data to various AWS services and the code is designed to scale up and down a queue, batch sizing, etc based on current load. We effectively have 6 outbound pipelines to Kinesis, Firehose, and Dynamo. While we have dialed in all the individual services to stay within publish limits, made sure shards are sized appropriately, sized our batches appropriately, and confirmed that the code is not hitting CPU/Mem limits, we're still seeing occasional periods when a SIGKILL ends the process without any other error. It seems to be shut down at a higher level outside of our control. This is always accompanied with a high burst load of 100+ active outgoing requests.

We had a theory that we were hitting an undocumented network limit but until finding that blog post and this issue, I wasn't sure if we could concretely point at network limits.


Editing to answer some of the questions above...

Would a more consistent network bandwidth at the current performance level solve this? Or no, and you need more than today—what minimum/baseline bandwidth would you need in Gbps?

In our use case, a documented and trusted bandwidth limit that scales up in a transparent way based on other resources would be a huge benefit. If we need to up our CPU to reach higher network limits, that's better than the current black box of handling this scaling problem.

Would a more consistent network bandwidth at the current performance level solve this? Or no, and you need more than today—what minimum/baseline bandwidth would you need in Gbps?

Agreed on what was said above... Matching ECS should be a minimum requirement. We wouldn't need 100 Gbps, but we certainly want to know real limits so that we can trust our bandwidth to be there when we need it.

Samrose-Ahmed commented 1 year ago

Adding our thoughts, we would be very interested in better networking for Fargate.

  1. Our workload is data compaction of S3 files, which is highly network bound. We see dropped connections from S3 when doing multipart uploads unless we increase CPU.
  2. Increased bandwidth
  3. Knowing limits would be appreciated but solution would be better bandwidth.
  4. Similar to EC2 would be good.
fideloper commented 8 months ago

To my memory (hinted at here perhaps) https://www.usenix.org/system/files/nsdi20-paper-agache.pdf one of Firecracker's limitations is actually bandwidth (Firecracker powers both Lambda and Fargate).

It's a different technology than what's powering EC2. I wonder what the Fargate team can do there!

billnbell2 commented 7 months ago

+1 we need higher network bandwidth. Also caching the ECR images would be a great feature so we can scale up faster. Right now it takes 1-3 minutes based on size of image in ECR. With managed nodes we can cache the images there for speed of scaling (down to 1 second or 2). See https://aws.amazon.com/blogs/containers/start-pods-faster-by-prefetching-images/

But just to be clear paying extra to set the POD to 2 CPU or higher to increase network performance is a huge cost when we only need .5 CPU for each POD. Since when using Fargate on EKS 1 POD = 1 Node. Plus we don't think the performance really increases using Fargate on EKS even when setting 4 or 8 CPUs.

Also during a "Super Bowl" event, we need as much network performance as we can get. It also appears the EKS Fargate underlying network system (Firecracker ?) is somehow limiting the performance - where I did not see this network performance limitation in ECS running Fargate.