Closed adityachandla closed 11 months ago
Hi @adityachandla ,
Thanks for opening this issue and for the detailed analysis. Are you seeing any explicit server errors after the 10,000 request mark? I would try enabling the logger to see if there are any obvious patterns you can see from the raw request /response.
cfg, err := config.LoadDefaultConfig(context.TODO(), config.WithRegion("eu-west-1"), config.WithClientLogMode(aws.LogRequestWithBody|aws.LogResponseWithBody))
I assume that 10,000 invocations would produce lots of logs, so I would be mindful of any financial implications if you are storing logs from the applications context.
As it stands, while interesting, the data is not really actionable for now. We need to understand where the bulk of the delay is happening, distinguishing between the time spent in the overall operation and the time spent on the HTTP call itself.
Here is an example of a probe @lucix-aws put together to help us better understand the nature of the delay. Can you integrate this example into your use case and share the result with us?
package main
import (
"context"
"fmt"
"log"
"net/http"
"sync"
"time"
awsmiddleware "github.com/aws/aws-sdk-go-v2/aws/middleware"
awshttp "github.com/aws/aws-sdk-go-v2/aws/transport/http"
"github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/sqs"
"github.com/aws/smithy-go/middleware"
smithyrand "github.com/aws/smithy-go/rand"
)
// WithOperationTiming instruments an SQS client to dump timing information for
// the following spans:
// - overall operation time
// - HTTPClient call time
//
// This instrumentation will also emit the request ID, service name, and
// operation name for each invocation.
//
// Accepts a message "handler" which is invoked with formatted messages to be
// handled externally, you can use the declared PrintfMSGHandler to simply dump
// these values to stdout.
func WithOperationTiming(msgHandler func(string)) func(*sqs.Options) {
return func(o *sqs.Options) {
o.APIOptions = append(o.APIOptions, addTimingMiddlewares(msgHandler))
o.HTTPClient = &timedHTTPClient{
client: awshttp.NewBuildableClient(),
msgHandler: msgHandler,
}
}
}
// PrintfMSGHandler writes messages to stdout.
func PrintfMSGHandler(msg string) {
fmt.Printf("%s\n", msg)
}
type invokeIDKey struct{}
func setInvokeID(ctx context.Context, id string) context.Context {
return middleware.WithStackValue(ctx, invokeIDKey{}, id)
}
func getInvokeID(ctx context.Context) string {
id, _ := middleware.GetStackValue(ctx, invokeIDKey{}).(string)
return id
}
func timeSpan(ctx context.Context, name string, consumer func(string)) func() {
start := time.Now()
return func() {
elapsed := time.Now().Sub(start)
consumer(fmt.Sprintf("[%s] %s: %s", getInvokeID(ctx), name, elapsed))
}
}
type timedHTTPClient struct {
client *awshttp.BuildableClient
msgHandler func(string)
}
func (c *timedHTTPClient) Do(r *http.Request) (*http.Response, error) {
defer timeSpan(r.Context(), "http", c.msgHandler)()
resp, err := c.client.Do(r)
if err != nil {
return nil, fmt.Errorf("inner client do: %v", err)
}
return resp, nil
}
type addInvokeIDMiddleware struct {
msgHandler func(string)
}
func (*addInvokeIDMiddleware) ID() string { return "addInvokeID" }
func (*addInvokeIDMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) (
out middleware.InitializeOutput, md middleware.Metadata, err error,
) {
id, err := smithyrand.NewUUID(smithyrand.Reader).GetUUID()
if err != nil {
return out, md, fmt.Errorf("new uuid: %v", err)
}
return next.HandleInitialize(setInvokeID(ctx, id), in)
}
type timeOperationMiddleware struct {
msgHandler func(string)
}
func (*timeOperationMiddleware) ID() string { return "timeOperation" }
func (m *timeOperationMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) (
middleware.InitializeOutput, middleware.Metadata, error,
) {
defer timeSpan(ctx, "operation", m.msgHandler)()
return next.HandleInitialize(ctx, in)
}
type emitMetadataMiddleware struct {
msgHandler func(string)
}
func (*emitMetadataMiddleware) ID() string { return "emitMetadata" }
func (m *emitMetadataMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) (
middleware.InitializeOutput, middleware.Metadata, error,
) {
out, md, err := next.HandleInitialize(ctx, in)
invokeID := getInvokeID(ctx)
requestID, _ := awsmiddleware.GetRequestIDMetadata(md)
service := awsmiddleware.GetServiceID(ctx)
operation := awsmiddleware.GetOperationName(ctx)
m.msgHandler(fmt.Sprintf(`[%s] requestID = "%s"`, invokeID, requestID))
m.msgHandler(fmt.Sprintf(`[%s] service = "%s"`, invokeID, service))
m.msgHandler(fmt.Sprintf(`[%s] operation = "%s"`, invokeID, operation))
return out, md, err
}
func addTimingMiddlewares(mh func(string)) func(*middleware.Stack) error {
return func(s *middleware.Stack) error {
if err := s.Initialize.Add(&timeOperationMiddleware{msgHandler: mh}, middleware.Before); err != nil {
return fmt.Errorf("add time operation middleware: %v", err)
}
if err := s.Initialize.Add(&addInvokeIDMiddleware{msgHandler: mh}, middleware.Before); err != nil {
return fmt.Errorf("add invoke id middleware: %v", err)
}
if err := s.Initialize.Insert(&emitMetadataMiddleware{msgHandler: mh}, "RegisterServiceMetadata", middleware.After); err != nil {
return fmt.Errorf("add emit metadata middleware: %v", err)
}
return nil
}
}
func main() {
cfg, err := config.LoadDefaultConfig(context.Background())
if err != nil {
log.Fatal(fmt.Errorf("load default config: %v", err))
}
svc := sqs.NewFromConfig(cfg, WithOperationTiming(PrintfMSGHandler))
var wg sync.WaitGroup
for i := 0; i < 6; i++ {
wg.Add(1)
go func() {
defer wg.Done()
_, err = svc.ListQueues(context.Background(), nil)
if err != nil {
fmt.Println(fmt.Errorf("list queues: %v", err))
}
}()
}
wg.Wait()
}
Thank you very much, Ran~
First of all, thanks for your reply @RanVaknin.
Regarding your first point about enabling request logging, there are metrics that count the number of 4xx or 5xx responses in S3 and I had already checked those metrics and there were no 4xx or 5xx responses returned by S3 so I don't think there is any point in printing and analyzing those.
I integrated the second part and this is the file that got generated from those logs: (I removed the logs related to service name and request type as that will always be S3 and GetObjectRequest) c7gn.2xlarge_512Kb_1000_x20.txt I wrote the following script in python to analyze the data and the difference between the operation time and http time is always less than 2 milliseconds.
import sys
import re
from dataclasses import dataclass
@dataclass
class RequestInfo:
operation: float
http: float
def main():
with open(sys.argv[1], "r") as f:
lines = f.read().strip().split("\n")
lines = [l for l in lines if "=" not in l] # Filter request id lines
matcher = re.compile("\[(.*)\] ([a-z]*): ([0-9\\.]*)(m?)s")
values = []
for l in lines:
values.append(matcher.findall(l)[0])
request_info = dict()
for v in values:
if v[0] in request_info:
curr = request_info[v[0]]
else:
curr = RequestInfo(0,0)
val = float(v[2])
if v[3] != 'm':
val *= 1000 # Value was in seconds
if v[1] == "http":
curr.http = float(v[2])
else:
curr.operation = float(v[2])
request_info[v[0]] = curr
for k,v in request_info.items():
delay = v.operation-v.http
if delay > 1:
print(v)
if __name__ == "__main__":
main()
However, I still see the same behaviour as before where request latency becomes significantly more after 10k requests: Also, the total request latency is still around 20-30 milliseconds according to the cloudwatch metrics.
I don't think that it should take more than a couple hundred milliseconds for a request to go from s3 to ec2 in the same region. In terms of network capacity, if we assume that all 20 goroutines are fetching 512KB of data every 20 milliseconds (S3 latency), even then the total data fetched per second is 512 x 20 x 50 bytes or around 500MB so the bandwidth being utilized would be around 4Gbps which is not even 10% of the bandwidth of c7gn.2xlarge.
Any suggestions on how I should proceed here?
the difference between the operation time and http time is always less than 2 milliseconds.
This is ultimately the measurement that matters, and the difference there is what I'd expect - there's no indication of SDK misbehavior.
Network throughput is not the only factor at this level of concurrency. There are a number of additional variables here, including, but not limited to
http.Transport
or net.Dialer
settings (you can configure this through the HTTPClient
field on service client Options
)The most important thing to understand here is that the entire HTTP operation is delegated to what is ultimately an instance of the standard library net/http
client, the behavior of which is outside of our control. There's no queueing, throttling, or pooling logic of any kind in front of that delegation - if you call GetObject
10,000 times in a second, we will blindly serialize 10,000 corresponding *http.Request
s and pass them to http.Client.Do
. So from our perspective, if we're not spending an inordinate amount of time up to that point of delegation (e.g. it's not taking us say 1 full second to go from you calling GetObject
to us sending the request) we're operating exactly as we should.
At this time I'm inclined to close this issue, since the data more or less exonerates the SDK. If you wish to dive deeper into the networking internals your best course of action is going to be to wire up an httptrace
to get a more detailed timing breakdown.
Thanks for your detailed response @lucix-aws. I will try to dig deeper into this issue using httptrace. I agree with you that if the SDK simply passes requests to the HttpClient then there is likely no issue here. Thanks a lot for helping out :bowing_man:
Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.
Describe the bug
I am performing a benchmark wherein I am fetching a byte range of an S3 object. This benchmark takes in the number of goroutines that will concurrently be fetching some bytes, the size of the byte range and the number of runs that each goroutine will perform.
This benchmark was performed on a c7gn.2xlarge instance in Ireland and the S3 bucket is also present in the same region.
Expected Behavior
When I run the benchmark, I expect to get latency times that are similar to the latency times on cloudwatch's S3 request metrics. This is indeed what I get when I fetch 512 bytes with 20 goroutines 1000 times each, this is the graph for the latency: And this is the total request latency in cloudwatch:
The request latency that I observe is more but not by that much and this is fine. However, something wierd happens when I change the fetch range from 512 bytes to 512 kilobytes.
Current Behavior
When I run the same benchmark with 512 kilobytes, 20 goroutines, each fetching the byte range 1000 times each, I see the following graph: As you can see, after around 10k requests, the latency increases drastically for some requests. However, according to the cloudwatch metrics, it was more or less constant: I have repeated this benchmark multiple times and at exactly 10k requests, I start to see this clustering of requests into higher latency buckets as can be seen in the graph.
Only two things can happen between the
start
andend
time, garbage collection and execution of SDK code. The former does not appear to be the problem as I have ran this benchmark with GOGC=500 and GOMEMLIMIT=4096MiB and got almost the same result. This leads me to believe that there is something happening in the SDK that is causing this behaviour.Reproduction Steps
I am running the code on a c7gn.2xlarge instance and on a 128MB file that is in an S3 bucket in the same region as the EC2 instance.
Possible Solution
No response
Additional Information/Context
I have already checked the following things:
AWS Go SDK V2 Module Versions Used
Compiler and Version used
go version go1.20.4 linux/amd64
Operating System and version
ubuntu 23.04