aws / aws-lambda-dotnet

Libraries, samples and tools to help .NET Core developers develop AWS Lambda functions.
Apache License 2.0
1.55k stars 477 forks source link

Debugging Invocation Errors that don't appear in the logs #245

Closed PaulColeman closed 3 years ago

PaulColeman commented 6 years ago

Does anyone have ideas how to debug invocation errors that don't appear in CloudWatch logs?

I am seeing cases where a lambda seemingly randomly will fail to invoke 1 to 7 times, incrementing the CloudWatch lambda error count, but no invocation (START, END, or REPORT) appears in the CloudWatch logs for the lambda. Nothing appears in the deadletter queue either.

I have 40ish similar lambdas and 8 of them had this same behavior at very similar times. These failures happen very infrequently, but when I do see them it is always in a similar pattern: multiple lambdas, cloudwatch error counters > 0, nothing in the cloudwatch logs.

I don't think this is a permissions issue with the lambda's ability to write to the logs as it will invoke correctly and START, END, REPORT etc do appear in the CloudWatch logs.

I assume it must be some issue setting up the environment -- the stuff that happens before invoke. How can I get to the bottom of this?

These lambdas are all .net core 1.0.

normj commented 6 years ago

Would it be possible to narrow down the the time window when these happen and provide the account id used?

PaulColeman commented 6 years ago

Thanks Norm. On one occasion it happened between 2018-03-15 19:56:00 and 2018-03-15 20:13:00 UTC. I think it happen all clustered within seconds but I'm not exactly sure where in that range it happened.

688458520130 is the id.

PaulColeman commented 6 years ago

Any update on this? Thanks for investigating.

On Sat, Mar 17, 2018, 4:28 AM Paul Coleman paul.coleman@gmail.com wrote:

Thanks Norm. It happened between 2018-03-15 19:56:00 and 2018-03-15 20:07:00 UTC. I think it happen all clustered within a few seconds but I'm not exactly sure where in that range it happened.

The account id is 688458520130

On Sat, Mar 17, 2018 at 7:04 AM Norm Johanson notifications@github.com wrote:

Would it be possible to narrow down the the time window when these happen and provide the account id used?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/aws/aws-lambda-dotnet/issues/245#issuecomment-373899687, or mute the thread https://github.com/notifications/unsubscribe-auth/ABBxJcf-UXUu8GvymDPukY2NTvsoB7EDks5tfLWEgaJpZM4Stiw2 .

paul-zah commented 6 years ago

Any updates on this issue? I'm also struggling with exactly the same issue. Invocation error reported, but nothing in the CloudWatch logs.

vellozzi commented 6 years ago

@PaulColeman and @paul-zah Can you provide minimal code repros for your functions that are dying? If you're able to isolate the code that's causing the issue it would help a lot.

mbp commented 6 years ago

I'm experiencing the same issue, load testing my function, I have an error count quite high, but cannot find any exceptions in my Lambda logs.

wv-jtowers commented 6 years ago

Just a thought - It might be worth trying to move the logging up to the LambdaEntryPoint (i.e. Program.cs) to handle any errors that might be thrown during bootstrapping the WebHost. I do this with serilog in non-lambda services following their guidance here (try / catch logging around program.cs): https://github.com/serilog/serilog-aspnetcore/blob/dev/samples/SimpleWebSample/Program.cs#L13

Grif-fin commented 6 years ago

Noticing the same issue. All invocations from CloudWatch failing which causes the a spike in the invocation errors monitoring tab but errors do not appear in logs.

SejalChauhan commented 6 years ago

I am seeing a similar issue where the invocation count is increasing infrequently and the cloudwatch logs don't have any error logs. Is this root caused?

VineeC commented 6 years ago

I am facing the same issue. There are logs which does not show any error and seems working as expected. But I can see the Cloudwatch alarm for errors triggering up when the lambda is invoked.

facundovs commented 5 years ago

Hi Guys. I am facing the same issue. Any updates on this?

newbreedofgeek commented 5 years ago

This may be an issue with Lambda itself - I've been noticing this behavior for months now. I dont use dotnet, instead use node and aws-sdk.

You will see errors in the lambda Monitoring dashboard and clicking through time range logs you will see no trace of the error. This in my opinion is one of the internal lambda "quirks", similar to idempotency issues in aws lambda (where you cant guarantee your function will run exactly once... it can run multiple times, seconds/mins apart even when there is no error detected) - like the idempotency issue you will need to do some defensive coding in your app to account for internal errors, make sure you do proper error handling in your code and catch/throw errors with proper log tracing.

If you then see "internal errors" that seem to happen outside your error handling you should be able to discount them as anomalies or false positives as you are confident in your error handling coverage. (not ideal but one of the quirks of serverless computing, the issue is one someone else server :)

matthewdenobrega commented 4 years ago

I am seeing the same thing - on node 10 lambdas. The alarms are triggered on errors crossing a threshold, but there are no errors in the logs. I end up wasting quite a lot of time checking false positives, and I can't see a reason why this could be the desired behaviour, so would be great if the lambda team could improve this area.

aya-givati commented 4 years ago

Happened to me as well... using python. My lambda is triggered every second, so up until now I was sure I cannot find it because I have so many logs and I am not looking for the right filter... never thought that there are simply no logs... However, I dont think it is random. It usually happens when there are problems in the DB...

nidhi7 commented 4 years ago

This happened to me as well. I have java sdk lambdas in 2 separate regions and both of them generated error metrics from 6:45 - 7:10 AM CDT but there are no ERROR logs in cloudwatch.

adrai commented 4 years ago

Other stuff to look at:

marcioemiranda commented 3 years ago

This was happening to me as well in python 3.8.

Use the following query in Log Insights: fields @timestamp, @message | filter @message like "Process exited before completing request" | sort @timestamp asc | limit 20

It might be a memory problem causing the error. A timeout can also cause an error in lambda and you have to used a different query to find it.

ashishdhingra commented 3 years ago

Hi @PaulColeman,

Good morning.

I was going through the issue backlog and came across this guidance question. Please let me know if this is still an issue or else if this could be closed.

Thanks, Ashish

github-actions[bot] commented 3 years ago

This issue has not recieved a response in 2 weeks. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled.

aya-givati commented 3 years ago

still an issue...

ashishdhingra commented 3 years ago

Hi @PaulColeman @aya-givati,

Please have a look at the article How do I troubleshoot Lambda function failures? and let me know if it helps.

Thanks, Ashish

aya-givati commented 3 years ago

Hi @ashishdhingra, Thank you fir your response. Unfortunately it did NOT help me. my problem is that my "Error" metric Alert is on and I cannt find the lines in the log that explain why

ashishdhingra commented 3 years ago

@aya-givati I'm not sure what to recommend here since the invocation errors occur outside of .NET SDK. As explained in the documentation link I shared, for any code related errors, CloudWatch is the option. However for invocation errors, Cloudtrail could be the option. I would suggest to contact CloudWatch support for more details for troubleshooting. I will try to see if I could find any guidance, but this doesn't appears to be the .NET SDK issue.

I do see that you are using Python SDK. So this issue appears to be service specific, not a specific SDK issue. Were you able to get guidance from Python SDK team which might be helpful?

github-actions[bot] commented 3 years ago

This issue has not recieved a response in 2 weeks. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled.

wanpdsantos commented 2 years ago

Getting the same errors! Also spent a lot of time trying to understand.

Nathi360 commented 2 years ago

Experincing the same. Added try-catch mechanisms with logging in my services. None of the logging is found on cloudwatch logs even though cloudwatch error alerts on the lambda are firing off.

rma6 commented 2 years ago

I'm having the same problem

moltar commented 2 years ago

Having the same issue on Node 14. Nothing in the logs at all, yet alarms get tripped.

mareksamec commented 1 year ago

Same issue with Python. All red in metrics graph, standard logging enabled but absolutely nothing in CW logs.

EzequielGhR commented 1 year ago

Experiencing the same issue with Python, and also wasting lots of time

KumarAbhinav2 commented 1 year ago

Try Querying the log insights with "Task timed out" Phrase

fields @timestamp, @message | filter @message like "Task timed out" | sort @timestamp asc | limit 20

andrekardec commented 1 year ago

For me the problem was the policy attached to the Lambda. I made a custom policy and the log-group ARN was not right. Fixing that, fixed the problem - running a Python lambda as well. To check it using the AWS Console, go to configurations > permissions, and check if the role has appropriate policy.

haseebr commented 1 year ago

Try Querying the log insights with "Task timed out" Phrase

fields @timestamp, @message | filter @message like "Task timed out" | sort @timestamp asc | limit 20

thank you! this has saved me a lot of head banging.

bassjompi commented 1 year ago

Try Querying the log insights with "Task timed out" Phrase

fields @timestamp, @message | filter @message like "Task timed out" | sort @timestamp asc | limit 20

saved my life! thanks so much

7imo commented 6 months ago

I have the same issue when I want to trigger my Lambda with a file upload to S3. I don't see any invocation but the error rate increases. I don't see any task timed out logs either, for me there are no logs at all. It happens only in one environment - the exact same function in another environment works fine.