AWS SDK CurlHttp requests

sglnd commented 6 months ago

Describe the bug

We are using CurveFS to mount object storage as local files, and CurveFS relies on AWS SDK for C++ implementation. When we were making a change to an object storage bucket, we first disabled the bucket mounted by CurveFS, and then deleted it. After completing the bucket deletion, AWS SDK for C++ started generating a large number of CurlHttp requests. Upon reviewing the logs and code printed by the AWS SDK for C++ client, we found that after disabling the bucket, all returned httpreturncodes were 403. However, we are puzzled about the increase in the number of CurlHttp requests. As a result of this increase in CurlHttp requests, and due to our DNS configuration issue, DNS issues were also triggered. Therefore, we would like to identify the root cause of this problem. please help ～

Expected Behavior

Help explain the reason, and it's not certain if this is a bug, but it feels like it could be optimized. The highlighted area in the graph represents the situation after the bucket has been deleted.

Current Behavior

WechatIMG181 WechatIMG182

The highlighted area in the graph represents the situation after the bucket has been deleted.

Reproduction Steps

I reviewed some of the code and it seems that after the bucket is deleted, although the HTTP status code still returns 403, there are some differences in the internal returns, causing a change in the logic of the AWS client. However, I cannot confirm this.

Possible Solution

No response

Additional Information/Context

No response

AWS CPP SDK version used

not sure

Compiler and Version used

not sure

Operating System and version

suse sp15

jmklix commented 6 months ago

This looks like it might be a problem with CurveFS and trying to make sdk calls after the bucket gets deleted. Please open an issue with that product if you would like help with that. If you can reproduce this error and provide sdk logs then we could try to debug it, but currently there isn't enough information to do that. Please let me know if you have any aws-sdk-cpp specific questions that I can help you with.

sglnd commented 6 months ago

This looks like it might be a problem with CurveFS and trying to make sdk calls after the bucket gets deleted. Please open an issue with that product if you would like help with that. If you can reproduce this error and provide sdk logs then we could try to debug it, but currently there isn't enough information to do that. Please let me know if you have any aws-sdk-cpp specific questions that I can help you with.

aws_del_bef.zip aws_del_after.zip

I provided logs both before and after the bucket was deleted. Let me describe the background again: during the process, the bucket was Forbidden, and there was no change in the access to CurveFs from the upper layer. However, it was only when the bucket was finally deleted that a large number of CurlHttp accesses occurred. This timing coincided, leading us to believe that it was the different information returned after the bucket was deleted that caused some issues with the SDK's client judgment.

jmklix commented 6 months ago

Can you please also provide a code sample that reproduces this error with only using the aws-sdk-cpp and s3? We don't guarantee that this sdk will work with curveFS

sglnd commented 6 months ago

Can you please also provide a code sample that reproduces this error with only using the aws-sdk-cpp and s3? We don't guarantee that this sdk will work with curveFS

We've also conducted ongoing analysis. Prior to the bucket deletion, all accesses resulted in HTTP 403 errors. However, after the deletion, there was a significant increase in HTTP 500 errors, accompanied by a large number of Curl HTTP requests. We've also reviewed the code for CurveFS and aws-sdk-cpp.

CurveFS code：

void S3Adapter::PutObjectAsync(std::shared_ptr<PutObjectAsyncContext> context) {
    Aws::S3::Model::PutObjectRequest request;
    request.SetBucket(bucketName_);
    request.SetKey(Aws::String{context->key.c_str(), context->key.size()});

    request.SetBody(Aws::MakeShared<PreallocatedIOStream>(
        AWS_ALLOCATE_TAG, context->buffer, context->bufferSize));

    auto originCallback = context->cb;
    auto wrapperCallback =
        [this,
         originCallback](const std::shared_ptr<PutObjectAsyncContext>& ctx) {
            inflightBytesThrottle_->OnComplete(ctx->bufferSize);
            ctx->cb = originCallback;
            ctx->cb(ctx);
        };

    Aws::S3::PutObjectResponseReceivedHandler handler =
        [context](
            const Aws::S3::S3Client * /*client*/,
            const Aws::S3::Model::PutObjectRequest & /*request*/,
            const Aws::S3::Model::PutObjectOutcome &response,
            const std::shared_ptr<const Aws::Client::AsyncCallerContext>
                &awsCtx) {
            std::shared_ptr<PutObjectAsyncContext> ctx =
                std::const_pointer_cast<PutObjectAsyncContext>(
                    std::dynamic_pointer_cast<const PutObjectAsyncContext>(
                        awsCtx));

            LOG_IF(ERROR, !response.IsSuccess())
                << "PutObjectAsync error: "
                << response.GetError().GetExceptionName()
                << "message: " << response.GetError().GetMessage()
                << "resend: " << ctx->key;

            ctx->retCode = (response.IsSuccess() ? 0 : -1);
            ctx->timer.stop();
            ctx->cb(ctx);
        };

    if (throttle_) {
        throttle_->Add(false, context->bufferSize);
    }

    inflightBytesThrottle_->OnStart(context->bufferSize);
    context->cb = std::move(wrapperCallback);
    s3Client_->PutObjectAsync(request, handler, context);
}

aws-sdk-cpp code：

HttpResponseOutcome AWSClient::AttemptExhaustively(const Aws::Http::URI& uri,
    Aws::Http::HttpMethod method,
    const char* signerName,
    const char* requestName,
    const char* signerRegionOverride,
    const char* signerServiceNameOverride) const
{
    if (!Aws::Utils::IsValidHost(uri.GetAuthority()))
    {
        return HttpResponseOutcome(AWSError<CoreErrors>(CoreErrors::VALIDATION, "", "Invalid DNS Label found in URI host", false/*retryable*/));
    }

    std::shared_ptr<HttpRequest> httpRequest(CreateHttpRequest(uri, method, Aws::Utils::Stream::DefaultResponseStreamFactoryMethod));
    HttpResponseOutcome outcome;
    AWSError<CoreErrors> lastError;
    Aws::Monitoring::CoreMetricsCollection coreMetrics;
    auto contexts = Aws::Monitoring::OnRequestStarted(this->GetServiceClientName(), requestName, httpRequest);
    const char* signerRegion = signerRegionOverride;
    Aws::String regionFromResponse;

    Aws::String invocationId = Aws::Utils::UUID::PseudoRandomUUID();
    RequestInfo requestInfo;
    requestInfo.attempt = 1;
    requestInfo.maxAttempts = 0;
    httpRequest->SetHeaderValue(Http::SDK_INVOCATION_ID_HEADER, invocationId);
    httpRequest->SetHeaderValue(Http::SDK_REQUEST_HEADER, requestInfo);
    AppendRecursionDetectionHeader(httpRequest);

    for (long retries = 0;; retries++)
    {
        if(!m_retryStrategy->HasSendToken())
        {
            return HttpResponseOutcome(AWSError<CoreErrors>(CoreErrors::SLOW_DOWN,
                                                            "",
                                                            "Unable to acquire enough send tokens to execute request.",
                                                            false/*retryable*/));

        };
        outcome = AttemptOneRequest(httpRequest, signerName, requestName, signerRegion, signerServiceNameOverride);
        outcome.SetRetryCount(retries);
        if (retries == 0)
        {
            m_retryStrategy->RequestBookkeeping(outcome);
        }
        else
        {
            m_retryStrategy->RequestBookkeeping(outcome, lastError);
        }
        coreMetrics.httpClientMetrics = httpRequest->GetRequestMetrics();
        TracingUtils::EmitCoreHttpMetrics(httpRequest->GetRequestMetrics(),
            *m_telemetryProvider->getMeter(this->GetServiceClientName(), {}),
            {{TracingUtils::SMITHY_METHOD_DIMENSION, requestName},{TracingUtils::SMITHY_SERVICE_DIMENSION, this->GetServiceClientName()}});
        if (outcome.IsSuccess())
        {
            Aws::Monitoring::OnRequestSucceeded(this->GetServiceClientName(), requestName, httpRequest, outcome, coreMetrics, contexts);
            AWS_LOGSTREAM_TRACE(AWS_CLIENT_LOG_TAG, "Request successful returning.");
            break;
        }
        lastError = outcome.GetError();

        DateTime serverTime = GetServerTimeFromError(outcome.GetError());
        auto clockSkew = DateTime::Diff(serverTime, DateTime::Now());

        Aws::Monitoring::OnRequestFailed(this->GetServiceClientName(), requestName, httpRequest, outcome, coreMetrics, contexts);

        if (!m_httpClient->IsRequestProcessingEnabled())
        {
            AWS_LOGSTREAM_TRACE(AWS_CLIENT_LOG_TAG, "Request was cancelled externally.");
            break;
        }

        // Adjust region
        bool retryWithCorrectRegion = false;
        HttpResponseCode httpResponseCode = outcome.GetError().GetResponseCode();
        if (httpResponseCode == HttpResponseCode::MOVED_PERMANENTLY ||  // 301
            httpResponseCode == HttpResponseCode::TEMPORARY_REDIRECT || // 307
            httpResponseCode == HttpResponseCode::BAD_REQUEST ||        // 400
            httpResponseCode == HttpResponseCode::FORBIDDEN)            // 403
        {
            regionFromResponse = GetErrorMarshaller()->ExtractRegion(outcome.GetError());
            if (m_region == Aws::Region::AWS_GLOBAL && !regionFromResponse.empty() && regionFromResponse != signerRegion)
            {
                signerRegion = regionFromResponse.c_str();
                retryWithCorrectRegion = true;
            }
        }

        long sleepMillis = TracingUtils::MakeCallWithTiming<long>(
            [&]() -> long {
                return m_retryStrategy->CalculateDelayBeforeNextRetry(outcome.GetError(), retries);
            },
            TracingUtils::SMITHY_CLIENT_SERVICE_BACKOFF_DELAY_METRIC,
            *m_telemetryProvider->getMeter(this->GetServiceClientName(), {}),
            {{TracingUtils::SMITHY_METHOD_DIMENSION, requestName},{TracingUtils::SMITHY_SERVICE_DIMENSION, this->GetServiceClientName()}});
        //AdjustClockSkew returns true means clock skew was the problem and skew was adjusted, false otherwise.
        //sleep if clock skew and region was NOT the problem. AdjustClockSkew may update error inside outcome.
        bool shouldSleep = !AdjustClockSkew(outcome, signerName) && !retryWithCorrectRegion;

        if (!retryWithCorrectRegion && !m_retryStrategy->ShouldRetry(outcome.GetError(), retries))
        {
            break;
        }

        AWS_LOGSTREAM_WARN(AWS_CLIENT_LOG_TAG, "Request failed, now waiting " << sleepMillis << " ms before attempting again.");

        if (shouldSleep)
        {
            m_httpClient->RetryRequestSleep(std::chrono::milliseconds(sleepMillis));
        }

        Aws::Http::URI newUri = uri;
        Aws::String newEndpoint = GetErrorMarshaller()->ExtractEndpoint(outcome.GetError());
        if (!newEndpoint.empty())
        {
            newUri.SetAuthority(newEndpoint);
        }
        httpRequest = CreateHttpRequest(newUri, method, Aws::Utils::Stream::DefaultResponseStreamFactoryMethod);

        httpRequest->SetHeaderValue(Http::SDK_INVOCATION_ID_HEADER, invocationId);
        if (serverTime.WasParseSuccessful() && serverTime != DateTime())
        {
            requestInfo.ttl = DateTime::Now() + clockSkew + std::chrono::milliseconds(m_requestTimeoutMs);
        }
        requestInfo.attempt ++;
        requestInfo.maxAttempts = m_retryStrategy->GetMaxAttempts();
        httpRequest->SetHeaderValue(Http::SDK_REQUEST_HEADER, requestInfo);
        Aws::Monitoring::OnRequestRetry(this->GetServiceClientName(), requestName, httpRequest, contexts);
    }
    auto meter = m_telemetryProvider->getMeter(this->GetServiceClientName(), {});
    auto counter = meter->CreateCounter(TracingUtils::SMITHY_CLIENT_SERVICE_ATTEMPTS_METRIC, TracingUtils::COUNT_METRIC_TYPE, "");
    counter->add(requestInfo.attempt, {{TracingUtils::SMITHY_METHOD_DIMENSION, requestName},{TracingUtils::SMITHY_SERVICE_DIMENSION, this->GetServiceClientName()}});
    Aws::Monitoring::OnFinish(this->GetServiceClientName(), requestName, httpRequest, contexts);
    return outcome;
}

Summary: In the logic of AWS SDK for C++, the absence of a sleep buffer mechanism when encountering HTTP 500 errors may lead to the re-creation of HTTP requests, potentially triggering DNS requests. Therefore, there are still several questions we haven't fully understood:

After the bucket is disabled, which requests from object storage will return HTTP 500 error responses before and after deletion? Why do a sudden surge of Curl requests occur instead of requests related to object storage operations?

DmitriyMusatkin commented 4 months ago

couple observations:

you seem to be running against a 3p S3 implementation, guessing from the "server : openresty/1.19.3.1" header in the responses and from 403 UserSuspended, which I believe AWS version of S3 never sends?
SDK is seeing 500s returned when trying to make data plane calls against the bucket that no longer exists. SDK cannot parse error response(body does not seem to be formatted correctly?). this also seems like a non-standard behavior. I believe regular s3 returns 404 in this case
CPP SDK retries 5xx errors. Default is up to 10 retries with fairly small increment between each
curvefs seems to override default cpp sdk retry policy to have a tighter timeouts
CPP SDK might retire ips that return 500s from the pool of resolved ips to avoid hotspotting bad ips across retries.

overall sdk will not do requests on its own. Likely whats happening is that s3 implementation you are hitting is returning 500s , which causes a bit of retry storm, which is further amplified by non-ideal retry config either in your app or curveFS. From the above i dont see anything particularly wrong happening on CPP SDK side, its working as designed.

github-actions[bot] commented 4 months ago

Greetings! It looks like this issue hasn’t been active in longer than a week. We encourage you to check if this is still an issue in the latest release. Because it has been longer than a week since the last update on this, and in the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or add an upvote to prevent automatic closure, or if the issue is already closed, please feel free to open a new one.

aws / aws-sdk-cpp