aws / aws-sdk-go-v2

AWS SDK for the Go programming language.
https://aws.github.io/aws-sdk-go-v2/docs/
Apache License 2.0
2.5k stars 602 forks source link

Support SDK metrics for go v2 AWS SDK #1744

Open DanielBauman88 opened 2 years ago

DanielBauman88 commented 2 years ago

Describe the feature

The java feature is documented here. The functionality is described in this section.

The request is to support the same for the go sdk so that it is trivial to get metrics for latencies/errors/retries to aws dependencies made in a customer application.

Use Case

I want to have operational metrics for latency,error,num-calls for all my dependencies so that I can monitor the performance of my service and dig into problems and investigate the impact of outages.

Proposed Solution

To implement this functionality with a simple option on SDK creation in the go sdk v2.

Other Information

No response

Acknowledgements

AWS Go SDK V2 Module Versions Used

This is applicable to all SDKs

Go version used

This should be applicable to all go versions

jeichenhofer commented 1 year ago

I'm also looking to integrate some metrics with the aws-sdk-go-v2 libraries, but I don't want to re-invent the wheel. Hopefully this will be an officially supported feature, but I also need a solution in the meantime. Specifically, I want to record a tuple of service name, operation name, aws region, latency, retry count, and response code on every request sent to AWS. I can envision doing this with the "middleware" API, but these are the only docs I can find, and they don't do a great job explaining what information about the request is available (https://aws.github.io/aws-sdk-go-v2/docs/middleware/) (e.g., would we need to record the "sent time" in Initialize step, then check it in the deserialize step, or is latency already a populated metadata value).

While we wait for a response from the development team about incorporating this as an SDK feature, is there any guidance on implementing something ourselves?

jeichenhofer commented 1 year ago

Here's what I could come up with by stepping through the middleware stack code. It seems to work as intended, but I'd be curious to hear from people more familiar with the API.

Of course, this would need to be incorporated with some existing metrics system, replacing the ReportMetrics function with something that feeds into monitoring systems or log files. If there's a chance that the function might return an error, then I'd have to think a bit more about how to handle that.

Also, because this is placed "after" all of the other deserializers, it will be executed per retry. That's why I left the "retry on access denied" code in there, to test out what happens when a retried operation fails. The output measures the latency of each individual retry request (by default that's three requests total). I thought replacing the smithymiddleware.After with smithymiddleware.Before would measure latency of the combined three round-trips, but that was not the case. Since I want the behavior to be per-retry, I didn't investigate further.

Here is the working code to test this out. Just replace the AKID and SKEY constants with IAM User credentials with no access, and you'll see the metrics spit out from the three requests with a 403 response code.

package main

import (
    "context"
    "fmt"
    "github.com/aws/aws-sdk-go-v2/aws"
    sdkmiddleware "github.com/aws/aws-sdk-go-v2/aws/middleware"
    "github.com/aws/aws-sdk-go-v2/aws/retry"
    "github.com/aws/aws-sdk-go-v2/config"
    "github.com/aws/aws-sdk-go-v2/credentials"
    "github.com/aws/aws-sdk-go-v2/service/s3"
    smithymiddleware "github.com/aws/smithy-go/middleware"
    "github.com/aws/smithy-go/transport/http"
    "time"
)

const (
    AKID = "akid_here"
    SKEY = "secret_access_key_here"
    SESH = ""
)

type RequestMetricTuple struct {
    ServiceName   string
    OperationName string
    Region        string
    LatencyMS     int64
    ResponseCode  int
}

func ReportMetrics(metrics *RequestMetricTuple) {
    fmt.Printf("metrics: %+v\n", metrics)
}

func reportMetricsMiddleware() smithymiddleware.DeserializeMiddleware {
    reportRequestMetrics := smithymiddleware.DeserializeMiddlewareFunc("ReportRequestMetrics", func(
        ctx context.Context, in smithymiddleware.DeserializeInput, next smithymiddleware.DeserializeHandler,
    ) (
        out smithymiddleware.DeserializeOutput, metadata smithymiddleware.Metadata, err error,
    ) {
        requestMadeTime := time.Now()
        out, metadata, err = next.HandleDeserialize(ctx, in)
        if err != nil {
            return out, metadata, err
        }

        responseStatusCode := -1
        switch resp := out.RawResponse.(type) {
        case *http.Response:
            responseStatusCode = resp.StatusCode
        }

        latency := time.Now().Sub(requestMadeTime)
        metrics := RequestMetricTuple{
            ServiceName:   sdkmiddleware.GetServiceID(ctx),
            OperationName: sdkmiddleware.GetOperationName(ctx),
            Region:        sdkmiddleware.GetRegion(ctx),
            LatencyMS:     latency.Milliseconds(),
            ResponseCode:  responseStatusCode,
        }
        ReportMetrics(&metrics)

        return out, metadata, nil
    })

    return reportRequestMetrics
}

func getDefaultConfig(ctx context.Context) (*aws.Config, error) {
    cfg, err := config.LoadDefaultConfig(
        ctx,
        config.WithCredentialsProvider(credentials.NewStaticCredentialsProvider(AKID, SKEY, SESH)),
        config.WithRetryer(
            func() aws.Retryer {
                return retry.AddWithErrorCodes(retry.NewStandard(), "AccessDenied")
            },
        ),
    )
    if err != nil {
        return nil, err
    }

    cfg.APIOptions = append(cfg.APIOptions, func(stack *smithymiddleware.Stack) error {
        return stack.Deserialize.Add(reportMetricsMiddleware(), smithymiddleware.After)
    })

    return &cfg, nil
}

func doStuff(ctx context.Context, client *s3.Client) {
    listBucketResults, err := client.ListBuckets(ctx, &s3.ListBucketsInput{})
    if err != nil {
        panic(err)
    }
    fmt.Printf("num_buckets: %d\n", len(listBucketResults.Buckets))
}

func main() {
    ctx := context.Background()
    cfg, err := getDefaultConfig(ctx)
    if err != nil {
        panic(err)
    }

    client := s3.NewFromConfig(*cfg)

    for true {
        doStuff(ctx, client)
        time.Sleep(time.Second * 2)
    }
}
lucix-aws commented 7 months ago

related: #1142

We intend to implement this in terms of https://github.com/aws/smithy-go/issues/470, the internal spec for this component of the smithy client reference architecture is being finalized.

Please upvote this issue if this functionality is important to you as an SDK user.