aws / aws-lambda-dotnet

Libraries, samples and tools to help .NET Core developers develop AWS Lambda functions.
Apache License 2.0
1.57k stars 479 forks source link

AWS Lambda .NET 6 ARM Exceptions #1244

Closed kubalunacek closed 10 months ago

kubalunacek commented 2 years ago

Describe the bug

We have an application that runs on AWS Lambda. Until recently we were running it on .NET Core 3.1 and x86 architecture. We have now ported it to .NET6 on ARM64 and this has brought unexpected problems, as we are randomly getting errors that we believe point to a problem in AWS Lambda runtime for .NET 6. On the original application (.NET Core 3.1 + x86) these errors were not present. Does anyone have any experience with the migration to ARM and .NET6 and if it is more memory intensive or could it be a bug in the Lambda runtime? Any ideas on how to solve the problem will be gratefully received. Thanks.

Exception;

START RequestId: 79052b48-798c-4626-8cbb-7f9240196df6 Version: $LATEST  
Unhandled exception. System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values. (Parameter 'state')
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _)
at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
END RequestId: 79052b48-798c-4626-8cbb-7f9240196df6
REPORT RequestId: 79052b48-798c-4626-8cbb-7f9240196df6 Duration: 1051.21 ms Billed Duration: 1052 ms Memory Size: 512 MB Max Memory Used: 368 MB
RequestId: 79052b48-798c-4626-8cbb-7f9240196df6 Error: Runtime exited with error: signal: aborted Runtime.ExitError

Expected Behavior

Lambda works without errors.

Current Behavior

Lambda fails unexpectedly.

START RequestId: 79052b48-798c-4626-8cbb-7f9240196df6 Version: $LATEST  
Unhandled exception. System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values. (Parameter 'state')
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _)
at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
END RequestId: 79052b48-798c-4626-8cbb-7f9240196df6
REPORT RequestId: 79052b48-798c-4626-8cbb-7f9240196df6 Duration: 1051.21 ms Billed Duration: 1052 ms Memory Size: 512 MB Max Memory Used: 368 MB
RequestId: 79052b48-798c-4626-8cbb-7f9240196df6 Error: Runtime exited with error: signal: aborted Runtime.ExitError

Reproduction Steps

We have no trivial code for that, but we have global TryCatch block in our Lambda handler and it does not catch anything.

public class PingServer : AController
{
    public async Task<APIGatewayProxyResponse> HandleRequest(APIGatewayProxyRequest req, ILambdaContext context)
    {
        try
        {
            using (var scope = CreateScope())
            {
                return await HandleRequest(scope, scope.ServiceProvider.GetService<CoreController.PingServer>(), req, context);
            }
        }
        catch (Exception ex)
        {
            context.Logger.LogCritical($"Error from {nameof(PingServer)}_{nameof(HandleRequest)}\n{ex.ToString()}");
            throw;
        }
    }
}

Possible Solution

No response

Additional Information/Context

No response

AWS .NET SDK and/or Package version used

Amazon.Lambda.APIGatewayEvents 2.4.1 Amazon.Lambda.Core 2.1.0 Amazon.Lambda.Serialization.SystemTextJson 2.3.0 AWSXRayRecorder.Handlers.AwsSdk 2.8.3

Targeted .NET Platform

.NET 6

Operating System and version

AWS Lambda

normj commented 2 years ago

Does this happen during the cold start after a some number invocations? A way to check if this happens during cold starts would be if it is the first invoke in the CloudWatch Log Stream.

kubalunacek commented 2 years ago

Hey, it is not definitely cold start, via images. What is interesting that we have two Lambdas with same code base, the first one is 256MB of memory. The second one with 512MB of memory. The first always crash with Size: 256 MB Max Memory Used: 256 MB. The second one with Size: 512 MB Max Memory Used: 368 MB.

Lambda log: image

bjorg commented 2 years ago

I ran into the same issue as well. Also ARM64, but using 128MB for configuration. It's pretty rare as I've been doing benchmarking of various configuration and deployment options and have only seen it once.

image

bjorg commented 2 years ago

One thing I noticed is that all invocations were short until the one that failed. And it's about the same duration as in the original report.

image
kubalunacek commented 2 years ago

Hi @ashishdhingra , what do you need for reproduction? We have out production code deployed on AWS Lambdas. I am unable to provide you the production code. But we get this error once or four times a day. And from our observations, it's not so much the code as it is the number of calls at the moment. Because the other 10+ Lambdas with the same code base and less frequent calls don't suffer from anything like this.

dglozano commented 2 years ago

We are experiencing the exact same issue. Our try/catch in the handler doesn't catch anything, and it crashes with the same Runtime error.

It happens randomly, on average once a day I would say.

.NET 6, ARM64 and 1024MB of configured memory.

And we observe the same thing as @kubalunacek 👇🏻

But we get this error once or four times a day. And from our observations, it's not so much the code as it is the number of calls at the moment. Because the other 10+ Lambdas with the same code base and less frequent calls don't suffer from anything like this.

JirHan commented 2 years ago

Hey, we are having exactly the same issue. Few lambda calls per day fails with the error above. It is always the endpoints with the higher invocation number. Tried to debug and analyze this, but without any success.

LarsAndreasEk commented 2 years ago

We have the exact same issue on arm/.net6 lambdas

kubalunacek commented 2 years ago

We switched same solution (code base, .NET6) to x86_64 architecture and all these errors disappeared. So problem is in the.NET6 ARM64 Lambda runtime. Do you have some progress @normj ?

antonfirsov commented 1 year ago

This seems to be definitely related to https://github.com/dotnet/runtime/issues/72365.

@bjorg @kubalunacek @LarsAndreasEk or others, any chance you can produce & share a dump?

antonfirsov commented 1 year ago

Actually, this could be caused by ValueTask misuse in user code see https://github.com/dotnet/runtime/issues/70486#issuecomment-1151175694.

holthe-tveit commented 1 year ago

hi @antonfirsov I'm working on the same team as @LarsAndreasEk. I do not think we captured any dump related to this issue, but it did go away for us when we changed over to the AWS supported dotnet6 runtime and x86_64. Our code did not use any value tasks directly as far as I could find but under the hood this is of course in use.

kknd4eva commented 1 year ago

Reporting the same issue, running an AWS Lambda function using ARM64 and .NET6. The crash has only occurred a single time, and appears to be (though it's difficult to say for sure) in the code that requests information from DynamoDb.

        /// <summary>
        /// Get the token from the cache. If it's expired, generate a new one first
        /// </summary>
        /// <param name="env">The targeted environment, e.g. prod or preprod</param>
        /// <returns></returns>
        public async Task<DbCacheItem> GetTokenAsync(string env)
        {
            DbCacheItem dbCacheItem;
            DateTime expiryDateTime;

            try
            {
                dbCacheItem = await _dynamoDbContext.LoadAsync<DbCacheItem>(env, _dBOperationConfig);

                if (dbCacheItem == null)
                {
                    expiryDateTime = DateTime.UtcNow.AddMinutes(-5);
                }
                else
                {
                    expiryDateTime = DateTime.Parse(dbCacheItem.DateTimeOfExpiry);
                }

                if (expiryDateTime < DateTime.UtcNow)
                {
                    Logger.LogInformation($"Existing token is null or has expired. Generating a new one");
                    var sapCloudService = new SapCloudPlatformIntegrationService().WithEnvironment(env);
                    var refreshedToken = await sapCloudService.GenerateTokenAsync();
                    dbCacheItem = ConvertToCacheItem(refreshedToken, env);
                    await SaveTokenAsync(dbCacheItem);
                }

                return dbCacheItem;
            }
            catch (Exception ex)
            {
                throw;
            }

        }

The exception appears to be unhandled:

2023-02-14T05:11:38.681Z c33d7e47-4d0b-40f2-9f28-804408778387 info
{ "cold_start": false, "xray_trace_id": "1-63eb180a-79315a2f75da76e74d6273e2", "client": "S2CProcessMessage", "request_id": "No Request Id", "function_name": "Service-SCPI-Authenticator", "function_version": "$LATEST", "function_memory_size": 2048, "function_arn": "arn:aws:lambda:ap-southeast-2:xxxxxxx:function:Service-SCPI-Authenticator", "function_request_id": "c33d7e47-4d0b-40f2-9f28-804408778387", "timestamp": "2023-02-14T05:11:38.6812320Z", "level": "Information", "service": "service_undefined", "name": "AWS.Lambda.Powertools.Logging.Logger", "message": "Attempting to fetch access token" }

2023-02-14T16:11:38.694+11:00   Unhandled exception. System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values. (Parameter 'state')
2023-02-14T16:11:38.694+11:00   at System.Threading.ThreadPool.<>c.<.cctor>b__86_0(Object state)
2023-02-14T16:11:38.694+11:00   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow)
2023-02-14T16:11:38.694+11:00   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _)
2023-02-14T16:11:38.694+11:00   at System.Net.Sockets.SocketAsyncEventArgs.OnCompletedInternal()
2023-02-14T16:11:38.694+11:00   at System.Net.Sockets.SocketAsyncEventArgs.FinishOperationAsyncSuccess(Int32 bytesTransferred, SocketFlags flags)
2023-02-14T16:11:38.694+11:00   at System.Net.Sockets.SocketAsyncEventArgs.CompletionCallback(Int32 bytesTransferred, SocketFlags flags, SocketError socketError)
2023-02-14T16:11:38.694+11:00   at System.Net.Sockets.SocketAsyncEventArgs.TransferCompletionCallbackCore(Int32 bytesTransferred, Byte[] socketAddress, Int32 socketAddressSize, SocketFlags receivedFlags, SocketError socketError)
2023-02-14T16:11:38.694+11:00   at System.Net.Sockets.SocketAsyncContext.OperationQueue`1.ProcessAsyncOperation(TOperation op)
2023-02-14T16:11:38.694+11:00   at System.Net.Sockets.SocketAsyncContext.ReadOperation.System.Threading.IThreadPoolWorkItem.Execute()
2023-02-14T16:11:38.694+11:00   at System.Net.Sockets.SocketAsyncContext.HandleEvents(SocketEvents events)
2023-02-14T16:11:38.694+11:00   at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()
2023-02-14T16:11:38.694+11:00   at System.Threading.ThreadPoolWorkQueue.Dispatch()
2023-02-14T16:11:38.694+11:00   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
2023-02-14T16:11:38.694+11:00   at System.Threading.Thread.StartCallback()
2023-02-14T16:11:38.875+11:00
Copy
RequestId: c33d7e47-4d0b-40f2-9f28-804408778387 Error: Runtime exited with error: signal: aborted
Runtime.ExitError
RequestId: c33d7e47-4d0b-40f2-9f28-804408778387 Error: Runtime exited with error: signal: aborted Runtime.ExitError 

We have seen this error only once, but it is concerning. Should we look to change back to .NET 6 on x86 until this is understood?

kknd4eva commented 1 year ago

Reporting the same issue, running an AWS Lambda function using ARM64 and .NET6. The crash has only occurred a single time, and appears to be (though it's difficult to say for sure) in the code that requests information from DynamoDb.

        /// <summary>
        /// Get the token from the cache. If it's expired, generate a new one first
        /// </summary>
        /// <param name="env">The targeted environment, e.g. prod or preprod</param>
        /// <returns></returns>
        public async Task<DbCacheItem> GetTokenAsync(string env)
        {
            DbCacheItem dbCacheItem;
            DateTime expiryDateTime;

            try
            {
                dbCacheItem = await _dynamoDbContext.LoadAsync<DbCacheItem>(env, _dBOperationConfig);

                if (dbCacheItem == null)
                {
                    expiryDateTime = DateTime.UtcNow.AddMinutes(-5);
                }
                else
                {
                    expiryDateTime = DateTime.Parse(dbCacheItem.DateTimeOfExpiry);
                }

                if (expiryDateTime < DateTime.UtcNow)
                {
                    Logger.LogInformation($"Existing token is null or has expired. Generating a new one");
                    var sapCloudService = new SapCloudPlatformIntegrationService().WithEnvironment(env);
                    var refreshedToken = await sapCloudService.GenerateTokenAsync();
                    dbCacheItem = ConvertToCacheItem(refreshedToken, env);
                    await SaveTokenAsync(dbCacheItem);
                }

                return dbCacheItem;
            }
            catch (Exception ex)
            {
                throw;
            }

        }

The exception appears to be unhandled:

2023-02-14T05:11:38.681Z c33d7e47-4d0b-40f2-9f28-804408778387 info { "cold_start": false, "xray_trace_id": "1-63eb180a-79315a2f75da76e74d6273e2", "client": "S2CProcessMessage", "request_id": "No Request Id", "function_name": "Service-SCPI-Authenticator", "function_version": "$LATEST", "function_memory_size": 2048, "function_arn": "arn:aws:lambda:ap-southeast-2:xxxxxxx:function:Service-SCPI-Authenticator", "function_request_id": "c33d7e47-4d0b-40f2-9f28-804408778387", "timestamp": "2023-02-14T05:11:38.6812320Z", "level": "Information", "service": "service_undefined", "name": "AWS.Lambda.Powertools.Logging.Logger", "message": "Attempting to fetch access token" }

2023-02-14T16:11:38.694+11:00 Unhandled exception. System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values. (Parameter 'state')
2023-02-14T16:11:38.694+11:00 at System.Threading.ThreadPool.<>c.<.cctor>b__86_0(Object state)
2023-02-14T16:11:38.694+11:00 at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow)
2023-02-14T16:11:38.694+11:00 at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _)
2023-02-14T16:11:38.694+11:00 at System.Net.Sockets.SocketAsyncEventArgs.OnCompletedInternal()
2023-02-14T16:11:38.694+11:00 at System.Net.Sockets.SocketAsyncEventArgs.FinishOperationAsyncSuccess(Int32 bytesTransferred, SocketFlags flags)
2023-02-14T16:11:38.694+11:00 at System.Net.Sockets.SocketAsyncEventArgs.CompletionCallback(Int32 bytesTransferred, SocketFlags flags, SocketError socketError)
2023-02-14T16:11:38.694+11:00 at System.Net.Sockets.SocketAsyncEventArgs.TransferCompletionCallbackCore(Int32 bytesTransferred, Byte[] socketAddress, Int32 socketAddressSize, SocketFlags receivedFlags, SocketError socketError)
2023-02-14T16:11:38.694+11:00 at System.Net.Sockets.SocketAsyncContext.OperationQueue`1.ProcessAsyncOperation(TOperation op)
2023-02-14T16:11:38.694+11:00 at System.Net.Sockets.SocketAsyncContext.ReadOperation.System.Threading.IThreadPoolWorkItem.Execute()
2023-02-14T16:11:38.694+11:00 at System.Net.Sockets.SocketAsyncContext.HandleEvents(SocketEvents events)
2023-02-14T16:11:38.694+11:00 at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()
2023-02-14T16:11:38.694+11:00 at System.Threading.ThreadPoolWorkQueue.Dispatch()
2023-02-14T16:11:38.694+11:00 at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
2023-02-14T16:11:38.694+11:00 at System.Threading.Thread.StartCallback()
2023-02-14T16:11:38.875+11:00
Copy
RequestId: c33d7e47-4d0b-40f2-9f28-804408778387 Error: Runtime exited with error: signal: aborted
Runtime.ExitError
RequestId: c33d7e47-4d0b-40f2-9f28-804408778387 Error: Runtime exited with error: signal: aborted Runtime.ExitError 

We have seen this error only once, but it is concerning. Should we look to change back to .NET 6 on x86 until this is understood?

Just an FYI we have had more occurrences of this issue. Very infrequent, but still concerning.

antonfirsov commented 1 year ago

Note that that a bug has been identified in .NET that may be potentially causing this. It will be fixed in upcoming releases 6.0.18 and 7.0.7 (in June if I'm not mistaken): https://github.com/dotnet/runtime/pull/84432 https://github.com/dotnet/runtime/pull/84641

@kknd4eva and others, please stay tuned for the update and let us know if it helped!

kknd4eva commented 1 year ago

This is still occurring. Should we have expected this to be resolved by now? (Once again only impacting .NET6 + Arm64

`CloudWatch Logs Insights
region: ap-southeast-2
log-group-names: /aws/lambda/SOE-Order-File-Service
start-time: -43200s
end-time: 0s
query-string:

fields @timestamp, @logStream, @log, @message 
| filter @message like 'Specified argument was out of the range of valid values. (Parameter'

@timestamp @logStream @log @message
2023-07-30 06:49:39.317 2023/07/30/[$LATEST]2f985c3216d7442981641381a86663f8 298552055425:/aws/lambda/SOE-Order-File-Service Unhandled exception. System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values. (Parameter 'state')

---`

antonfirsov commented 1 year ago

Are you 100% sure your code is running on the updated (6.0.18+) runtime? What is your exact dotnet runtime version?

kknd4eva commented 1 year ago

Are you 100% sure your code is running on the updated (6.0.18+) runtime? What is your exact dotnet runtime version?

I believe so. The Lambda runtime update is set to 'AUTO' and the function runtime is currently:

INIT_START Runtime Version: dotnet.6v17 Runtime Version ARN: arn:aws:lambda:ap-southeast-2::runtime:4b286ddbfcfeaf773c960ecafbd89d987b45ebf710c123baf0640c1fd8798259 --
kknd4eva commented 1 year ago

Ah, I see - the version says v17. I would have thought an AUTO configuration would have automatically updated. I'll investigate how I can force this update.

philasmar commented 1 year ago

@kknd4eva could you please check if the issue has been resolved in the latest Lambda .NET patches? I believe they are on dotnet.6v19 now. I am trying to reproduce the issue locally, but I haven't been able to so far.

github-actions[bot] commented 10 months ago

This issue has not received a response in 5 days. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled.