CloudSnorkel / CloudWatch2S3

Logging infrastructure for exporting all CloudWatch logs from multiple accounts to a single S3 bucket
MIT License
48 stars 26 forks source link

Lambda.MissingRecordId #3

Open rsw31106 opened 4 years ago

rsw31106 commented 4 years ago

Hi I get some error message "errorCode":"Lambda.MissingRecordId","errorMessage":"One or more record Ids were not returned. Ensure that the Lambda function returns all received record Ids." in LogProcessorFunction

most of logs are deliveried to S3, but some logs got that error, How could I solve this?

kichik commented 4 years ago

Is it possible the result is over the 6mb Lambda limit?

https://github.com/CloudSnorkel/CloudWatch2S3/blob/d1b23d6c4509e89c91f5c0c67126643761fe4846/CloudWatch2S3.template#L513-L521

Do the records ever come back in the next chunk or are they completely lost?

rsw31106 commented 4 years ago

Thank you for your reply, record is not over 6mb, I am not sure all of error records are completely lost, but when I check with other db, I could confirm some records are completely lost. If S3's data is 70mb, error data is approximately 1mb,

kichik commented 4 years ago

Would you be able to add some logging at line 521 so we can confirm it's not this? I can't find any other code path that would cause this.

rsw31106 commented 4 years ago

Yes you are right, I am sorry. some records are over 6mb. How could I handle this to make it under 6mb? Processing Lambda Buffer Size is already 1mb, timeout is 60 seconds.

kichik commented 4 years ago

Ok, good. At least there is no mystery bug :) I should add a log line there anyway so it's clearer next time.

Easiest solution is to set log format to CloudWatch JSON (GZIP) and then the lambda is not used at all.

If that doesn't work, the solution depends on your data. Is a single log line over 6mb or just a chunk of them together?

If it's a bunch of them together, you can reduce ProcessorBufferSizeHint and/or ProcessorBufferIntervalHint.

If a single line item is over 6mb, I'm not sure what can be done. If that's the case, I'll have to think about it.

rsw31106 commented 4 years ago

Thank you for your reply, I need decoded log in s3, so lambda has to be called I guess, ProcessorBufferSizeHint and ProcessorBufferIntervalHint is already min value(1mb,60seconds), Could increasing shard count solve this? or If I add Kinesis Trigger to LogProcessorFunction and set Kinesis Trigger batchsize option to 50(default 100) , would it work?

kichik commented 4 years ago

Reducing the batch size might help, assuming there is no single record that's too big on its own. I don't think shard count will make a difference.

I added some more logging around this to missing_records_3 branch. Can you try it out and let me know what shows up in the processor function logs? If it's the second one, maybe we can requeue records. If it's the first, maybe we can break records down and requeue the rest.

rsw31106 commented 4 years ago

I tried your log, but logGroup key error occurred, so I change it to "Skipping single record that's over 6MB" and "Skipping record as output is over 6MB". and Log result was the second one "Skipping record as output is over 6MB". how can we requeue records?

kichik commented 4 years ago

Thanks. We need to give the processor function permission to add records to the stream and then resend them if they're too big for the output. I'll give it a shot after the holidays.

kichik commented 4 years ago

Gave it a second thought and manually requeuing records might cause result in out of order logs. So instead I tried explicitly setting retry count and properly reporting records being dropped or failed. Same branch. Can you give it another shot?

rsw31106 commented 4 years ago

I changed LogProcessorFunction as yours. but I could not add NumberOfRetries in template. I think I have to change template itself in CloudFormation but I am afraid to do that. Is there any other way? I just changed and let me know you whether MissingRecordId solved tomorrow

kichik commented 4 years ago

You can update the stack and ask to see the changes it's going to make before applying them. The documentation states a replacement is not required for NumberOfRetries change. If you want to manually update the parameter, it should be available under the delivery stream in the UI.

rsw31106 commented 4 years ago

Hi, I tried, there is no more MissingRecordId but get "Failing record as output is over 6MB" log and I checked these failed logs are in s3 processing-failed folder as raw data.

kichik commented 4 years ago

Progress! 😄 Were you able to set retries? Or did that not help?

rsw31106 commented 4 years ago

No, I could not set retries, I need manual update so I tried to find in UI but I couldn't. I found asynchronous invocation retries count in Lambda UI and this is set to 2, I don't know this value would be right.

kichik commented 4 years ago

It's an option of Firehose, not Lambda. But either way doesn't seem to be available in the UI. You can try using aws firehose update-destination. You would probably have to dump the old JSON configuration and then pass the new modified one.