Closed sglnd closed 4 months ago
This looks like it might be a problem with CurveFS and trying to make sdk calls after the bucket gets deleted. Please open an issue with that product if you would like help with that. If you can reproduce this error and provide sdk logs then we could try to debug it, but currently there isn't enough information to do that. Please let me know if you have any aws-sdk-cpp specific questions that I can help you with.
This looks like it might be a problem with CurveFS and trying to make sdk calls after the bucket gets deleted. Please open an issue with that product if you would like help with that. If you can reproduce this error and provide sdk logs then we could try to debug it, but currently there isn't enough information to do that. Please let me know if you have any aws-sdk-cpp specific questions that I can help you with.
aws_del_bef.zip aws_del_after.zip
I provided logs both before and after the bucket was deleted. Let me describe the background again: during the process, the bucket was Forbidden, and there was no change in the access to CurveFs from the upper layer. However, it was only when the bucket was finally deleted that a large number of CurlHttp accesses occurred. This timing coincided, leading us to believe that it was the different information returned after the bucket was deleted that caused some issues with the SDK's client judgment.
Can you please also provide a code sample that reproduces this error with only using the aws-sdk-cpp and s3? We don't guarantee that this sdk will work with curveFS
Can you please also provide a code sample that reproduces this error with only using the aws-sdk-cpp and s3? We don't guarantee that this sdk will work with curveFS
We've also conducted ongoing analysis. Prior to the bucket deletion, all accesses resulted in HTTP 403 errors. However, after the deletion, there was a significant increase in HTTP 500 errors, accompanied by a large number of Curl HTTP requests. We've also reviewed the code for CurveFS and aws-sdk-cpp.
CurveFS code:
void S3Adapter::PutObjectAsync(std::shared_ptr<PutObjectAsyncContext> context) {
Aws::S3::Model::PutObjectRequest request;
request.SetBucket(bucketName_);
request.SetKey(Aws::String{context->key.c_str(), context->key.size()});
request.SetBody(Aws::MakeShared<PreallocatedIOStream>(
AWS_ALLOCATE_TAG, context->buffer, context->bufferSize));
auto originCallback = context->cb;
auto wrapperCallback =
[this,
originCallback](const std::shared_ptr<PutObjectAsyncContext>& ctx) {
inflightBytesThrottle_->OnComplete(ctx->bufferSize);
ctx->cb = originCallback;
ctx->cb(ctx);
};
Aws::S3::PutObjectResponseReceivedHandler handler =
[context](
const Aws::S3::S3Client * /*client*/,
const Aws::S3::Model::PutObjectRequest & /*request*/,
const Aws::S3::Model::PutObjectOutcome &response,
const std::shared_ptr<const Aws::Client::AsyncCallerContext>
&awsCtx) {
std::shared_ptr<PutObjectAsyncContext> ctx =
std::const_pointer_cast<PutObjectAsyncContext>(
std::dynamic_pointer_cast<const PutObjectAsyncContext>(
awsCtx));
LOG_IF(ERROR, !response.IsSuccess())
<< "PutObjectAsync error: "
<< response.GetError().GetExceptionName()
<< "message: " << response.GetError().GetMessage()
<< "resend: " << ctx->key;
ctx->retCode = (response.IsSuccess() ? 0 : -1);
ctx->timer.stop();
ctx->cb(ctx);
};
if (throttle_) {
throttle_->Add(false, context->bufferSize);
}
inflightBytesThrottle_->OnStart(context->bufferSize);
context->cb = std::move(wrapperCallback);
s3Client_->PutObjectAsync(request, handler, context);
}
aws-sdk-cpp code:
HttpResponseOutcome AWSClient::AttemptExhaustively(const Aws::Http::URI& uri,
Aws::Http::HttpMethod method,
const char* signerName,
const char* requestName,
const char* signerRegionOverride,
const char* signerServiceNameOverride) const
{
if (!Aws::Utils::IsValidHost(uri.GetAuthority()))
{
return HttpResponseOutcome(AWSError<CoreErrors>(CoreErrors::VALIDATION, "", "Invalid DNS Label found in URI host", false/*retryable*/));
}
std::shared_ptr<HttpRequest> httpRequest(CreateHttpRequest(uri, method, Aws::Utils::Stream::DefaultResponseStreamFactoryMethod));
HttpResponseOutcome outcome;
AWSError<CoreErrors> lastError;
Aws::Monitoring::CoreMetricsCollection coreMetrics;
auto contexts = Aws::Monitoring::OnRequestStarted(this->GetServiceClientName(), requestName, httpRequest);
const char* signerRegion = signerRegionOverride;
Aws::String regionFromResponse;
Aws::String invocationId = Aws::Utils::UUID::PseudoRandomUUID();
RequestInfo requestInfo;
requestInfo.attempt = 1;
requestInfo.maxAttempts = 0;
httpRequest->SetHeaderValue(Http::SDK_INVOCATION_ID_HEADER, invocationId);
httpRequest->SetHeaderValue(Http::SDK_REQUEST_HEADER, requestInfo);
AppendRecursionDetectionHeader(httpRequest);
for (long retries = 0;; retries++)
{
if(!m_retryStrategy->HasSendToken())
{
return HttpResponseOutcome(AWSError<CoreErrors>(CoreErrors::SLOW_DOWN,
"",
"Unable to acquire enough send tokens to execute request.",
false/*retryable*/));
};
outcome = AttemptOneRequest(httpRequest, signerName, requestName, signerRegion, signerServiceNameOverride);
outcome.SetRetryCount(retries);
if (retries == 0)
{
m_retryStrategy->RequestBookkeeping(outcome);
}
else
{
m_retryStrategy->RequestBookkeeping(outcome, lastError);
}
coreMetrics.httpClientMetrics = httpRequest->GetRequestMetrics();
TracingUtils::EmitCoreHttpMetrics(httpRequest->GetRequestMetrics(),
*m_telemetryProvider->getMeter(this->GetServiceClientName(), {}),
{{TracingUtils::SMITHY_METHOD_DIMENSION, requestName},{TracingUtils::SMITHY_SERVICE_DIMENSION, this->GetServiceClientName()}});
if (outcome.IsSuccess())
{
Aws::Monitoring::OnRequestSucceeded(this->GetServiceClientName(), requestName, httpRequest, outcome, coreMetrics, contexts);
AWS_LOGSTREAM_TRACE(AWS_CLIENT_LOG_TAG, "Request successful returning.");
break;
}
lastError = outcome.GetError();
DateTime serverTime = GetServerTimeFromError(outcome.GetError());
auto clockSkew = DateTime::Diff(serverTime, DateTime::Now());
Aws::Monitoring::OnRequestFailed(this->GetServiceClientName(), requestName, httpRequest, outcome, coreMetrics, contexts);
if (!m_httpClient->IsRequestProcessingEnabled())
{
AWS_LOGSTREAM_TRACE(AWS_CLIENT_LOG_TAG, "Request was cancelled externally.");
break;
}
// Adjust region
bool retryWithCorrectRegion = false;
HttpResponseCode httpResponseCode = outcome.GetError().GetResponseCode();
if (httpResponseCode == HttpResponseCode::MOVED_PERMANENTLY || // 301
httpResponseCode == HttpResponseCode::TEMPORARY_REDIRECT || // 307
httpResponseCode == HttpResponseCode::BAD_REQUEST || // 400
httpResponseCode == HttpResponseCode::FORBIDDEN) // 403
{
regionFromResponse = GetErrorMarshaller()->ExtractRegion(outcome.GetError());
if (m_region == Aws::Region::AWS_GLOBAL && !regionFromResponse.empty() && regionFromResponse != signerRegion)
{
signerRegion = regionFromResponse.c_str();
retryWithCorrectRegion = true;
}
}
long sleepMillis = TracingUtils::MakeCallWithTiming<long>(
[&]() -> long {
return m_retryStrategy->CalculateDelayBeforeNextRetry(outcome.GetError(), retries);
},
TracingUtils::SMITHY_CLIENT_SERVICE_BACKOFF_DELAY_METRIC,
*m_telemetryProvider->getMeter(this->GetServiceClientName(), {}),
{{TracingUtils::SMITHY_METHOD_DIMENSION, requestName},{TracingUtils::SMITHY_SERVICE_DIMENSION, this->GetServiceClientName()}});
//AdjustClockSkew returns true means clock skew was the problem and skew was adjusted, false otherwise.
//sleep if clock skew and region was NOT the problem. AdjustClockSkew may update error inside outcome.
bool shouldSleep = !AdjustClockSkew(outcome, signerName) && !retryWithCorrectRegion;
if (!retryWithCorrectRegion && !m_retryStrategy->ShouldRetry(outcome.GetError(), retries))
{
break;
}
AWS_LOGSTREAM_WARN(AWS_CLIENT_LOG_TAG, "Request failed, now waiting " << sleepMillis << " ms before attempting again.");
if (shouldSleep)
{
m_httpClient->RetryRequestSleep(std::chrono::milliseconds(sleepMillis));
}
Aws::Http::URI newUri = uri;
Aws::String newEndpoint = GetErrorMarshaller()->ExtractEndpoint(outcome.GetError());
if (!newEndpoint.empty())
{
newUri.SetAuthority(newEndpoint);
}
httpRequest = CreateHttpRequest(newUri, method, Aws::Utils::Stream::DefaultResponseStreamFactoryMethod);
httpRequest->SetHeaderValue(Http::SDK_INVOCATION_ID_HEADER, invocationId);
if (serverTime.WasParseSuccessful() && serverTime != DateTime())
{
requestInfo.ttl = DateTime::Now() + clockSkew + std::chrono::milliseconds(m_requestTimeoutMs);
}
requestInfo.attempt ++;
requestInfo.maxAttempts = m_retryStrategy->GetMaxAttempts();
httpRequest->SetHeaderValue(Http::SDK_REQUEST_HEADER, requestInfo);
Aws::Monitoring::OnRequestRetry(this->GetServiceClientName(), requestName, httpRequest, contexts);
}
auto meter = m_telemetryProvider->getMeter(this->GetServiceClientName(), {});
auto counter = meter->CreateCounter(TracingUtils::SMITHY_CLIENT_SERVICE_ATTEMPTS_METRIC, TracingUtils::COUNT_METRIC_TYPE, "");
counter->add(requestInfo.attempt, {{TracingUtils::SMITHY_METHOD_DIMENSION, requestName},{TracingUtils::SMITHY_SERVICE_DIMENSION, this->GetServiceClientName()}});
Aws::Monitoring::OnFinish(this->GetServiceClientName(), requestName, httpRequest, contexts);
return outcome;
}
Summary: In the logic of AWS SDK for C++, the absence of a sleep buffer mechanism when encountering HTTP 500 errors may lead to the re-creation of HTTP requests, potentially triggering DNS requests. Therefore, there are still several questions we haven't fully understood:
After the bucket is disabled, which requests from object storage will return HTTP 500 error responses before and after deletion? Why do a sudden surge of Curl requests occur instead of requests related to object storage operations?
couple observations:
overall sdk will not do requests on its own. Likely whats happening is that s3 implementation you are hitting is returning 500s , which causes a bit of retry storm, which is further amplified by non-ideal retry config either in your app or curveFS. From the above i dont see anything particularly wrong happening on CPP SDK side, its working as designed.
Greetings! It looks like this issue hasn’t been active in longer than a week. We encourage you to check if this is still an issue in the latest release. Because it has been longer than a week since the last update on this, and in the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or add an upvote to prevent automatic closure, or if the issue is already closed, please feel free to open a new one.
Describe the bug
We are using CurveFS to mount object storage as local files, and CurveFS relies on AWS SDK for C++ implementation. When we were making a change to an object storage bucket, we first disabled the bucket mounted by CurveFS, and then deleted it. After completing the bucket deletion, AWS SDK for C++ started generating a large number of CurlHttp requests. Upon reviewing the logs and code printed by the AWS SDK for C++ client, we found that after disabling the bucket, all returned httpreturncodes were 403. However, we are puzzled about the increase in the number of CurlHttp requests. As a result of this increase in CurlHttp requests, and due to our DNS configuration issue, DNS issues were also triggered. Therefore, we would like to identify the root cause of this problem. please help ~
Expected Behavior
Help explain the reason, and it's not certain if this is a bug, but it feels like it could be optimized. The highlighted area in the graph represents the situation after the bucket has been deleted.
Current Behavior
The highlighted area in the graph represents the situation after the bucket has been deleted.
Reproduction Steps
I reviewed some of the code and it seems that after the bucket is deleted, although the HTTP status code still returns 403, there are some differences in the internal returns, causing a change in the logic of the AWS client. However, I cannot confirm this.
Possible Solution
No response
Additional Information/Context
No response
AWS CPP SDK version used
not sure
Compiler and Version used
not sure
Operating System and version
suse sp15