Open iDVB opened 5 years ago
So I take it deploying this monitor to all regions that could have edge function executions doesn't work (yet)? Targeting region functions log groups shouldn't be hard.
Why do you want to disable monitoring for those functions? Is it more broken than simply not working?
What issues are you seeing? As I said this service works great for us, but we are not currently using edge lambda function.
No, as is, it seems to be fully broken as I think the Dummy Custom Resources that initially calls those subscribe functions are failing and causing the deploy to fail and roll back. This is maybe to do with it trying to locate a replica's associated log group and failing?
Deploying to all regions might work but is a bit overkill in my mind, especially since the regions that are “likely to have edge functions” is all of them. That’s how they work, they get replicated to all edge locations and sit there ready anytime an end user visits from that area. So it would be impossible to predict.
I'm going to dive into the code a bit more and see if I can see exactly what the issue might be. But let me know if you have any ideas. I'm very happy to test them.
I'd love to see what the cloudformation template / resource lambda function logs look like (you need to be quick to find them before everything is torn down).
Also I'm sure @macmcirish has some input here in what would be interesting to look at!
AFK, for 1hr then will look for ya. Also, I think I can alter these refs to logs and have it check for replica log groups across regions. https://github.com/blackflux/lambda-monitor/blob/04e8d5ec00c5085388f8223d398b7acd5bfe6966/src/logic/util/lambda.js#L28
@iDVB Failing on the post deploy hook via a dummy resource is a frustrating issue that I plan to make much easier to understand :) up until now, I was the only one suffering from this, so I'm sorry to put you through my pain. Fortunately this is only an issue on the very first deploy of the stack, as when the initial deploy fails the teardown of the stack also deletes the log group.
To examine the log group before the teardown happens:
/aws/lambda/lambda-monitor-${YOUR_STAGE_HERE}-subscribe
and /aws/lambda/lambda-monitor-${YOUR_STAGE_HERE}-set-retention
appear. Open the log groups immediately.Once you have this info, it'll be a lot easier to track down the reason that this is failing to stabilise on deploy. Looking forward to hearing back from you!
Thanks! Actually, it seems to successfully send the error to Rollbar... so I CAN see that its due to
ResourceNotFoundException: The specified log group does not exist.
which led me to hypothesize about it not being able to fine the replica log groups.
Testing your steps to catch the logs though... OUTPUT:
START RequestId: NOTTELLING Version: $LATEST
2018-10-26T17:36:58.132Z NOTTELLING
"Status": "FAILED",
"Reason": "See the details in CloudWatch Log Stream: 2018/10/26/[$LATEST]NOTTELLING",
"PhysicalResourceId": "2018/10/26/[$LATEST]NOTTELLING",
"StackId": "arn:aws:cloudformation:us-east-1:NOTTELLING:stack/lambda-monitor-dev/NOTTELLING",
"RequestId": "NOTTELLING",
"LogicalResourceId": "TriggerLambdaSubscribeResource",
"Data": {}
}
2018-10-26T17:36:58.370Z NOTTELLING Status code: 200
2018-10-26T17:36:58.370Z NOTTELLING Status message: OK
2018-10-26T17:36:58.607Z NOTTELLING
{
"errorMessage": "The specified log group does not exist.",
"errorType": "ResourceNotFoundException",
"stackTrace": [
"Request.extractError (/var/task/node_modules/aws-sdk/lib/protocol/json.js:48:27)",
"Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:106:20)",
"Request.emit (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:78:10)",
"Request.emit (/var/task/node_modules/aws-sdk/lib/request.js:683:14)",
"Request.transition (/var/task/node_modules/aws-sdk/lib/request.js:22:10)",
"AcceptorStateMachine.runTo (/var/task/node_modules/aws-sdk/lib/state_machine.js:14:12)",
"/var/task/node_modules/aws-sdk/lib/state_machine.js:26:10",
"Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:38:9)",
"Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:685:12)",
"Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:116:18)",
"Request.emit (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:78:10)",
"Request.emit (/var/task/node_modules/aws-sdk/lib/request.js:683:14)",
"Request.transition (/var/task/node_modules/aws-sdk/lib/request.js:22:10)",
"AcceptorStateMachine.runTo (/var/task/node_modules/aws-sdk/lib/state_machine.js:14:12)",
"/var/task/node_modules/aws-sdk/lib/state_machine.js:26:10",
"Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:38:9)",
"Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:685:12)",
"Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:116:18)",
"callNextListener (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:96:12)",
"IncomingMessage.onEnd (/var/task/node_modules/aws-sdk/lib/event_listeners.js:294:13)",
"emitNone (events.js:111:20)",
"IncomingMessage.emit (events.js:208:7)"
]
}
END RequestId: NOTTELLING
REPORT RequestId: NOTTELLING Duration: 2351.20 ms Billed Duration: 2400 ms Memory Size: 256 MB Max Memory Used: 57 MB
START RequestId: NOTTELLING Version: $LATEST
2018-10-26T17:37:13.071Z NOTTELLING Response body:
{
"Status": "FAILED",
"Reason": "See the details in CloudWatch Log Stream: 2018/10/26/[$LATEST]NOTTELLING",
"PhysicalResourceId": "2018/10/26/[$LATEST]NOTTELLING",
"StackId": "arn:aws:cloudformation:us-east-1:NOTTELLING:stack/lambda-monitor-dev/NOTTELLING",
"RequestId": "NOTTELLING",
"LogicalResourceId": "TriggerLambdaSubscribeResource",
"Data": {}
}
2018-10-26T17:37:13.529Z NOTTELLING Status code: 200
2018-10-26T17:37:13.529Z NOTTELLING Status message: OK
2018-10-26T17:37:13.711Z NOTTELLING
{
"errorMessage": "The specified log group does not exist.",
"errorType": "ResourceNotFoundException",
"stackTrace": [
"Request.extractError (/var/task/node_modules/aws-sdk/lib/protocol/json.js:48:27)",
"Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:106:20)",
"Request.emit (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:78:10)",
"Request.emit (/var/task/node_modules/aws-sdk/lib/request.js:683:14)",
"Request.transition (/var/task/node_modules/aws-sdk/lib/request.js:22:10)",
"AcceptorStateMachine.runTo (/var/task/node_modules/aws-sdk/lib/state_machine.js:14:12)",
"/var/task/node_modules/aws-sdk/lib/state_machine.js:26:10",
"Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:38:9)",
"Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:685:12)",
"Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:116:18)",
"Request.emit (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:78:10)",
"Request.emit (/var/task/node_modules/aws-sdk/lib/request.js:683:14)",
"Request.transition (/var/task/node_modules/aws-sdk/lib/request.js:22:10)",
"AcceptorStateMachine.runTo (/var/task/node_modules/aws-sdk/lib/state_machine.js:14:12)",
"/var/task/node_modules/aws-sdk/lib/state_machine.js:26:10",
"Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:38:9)",
"Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:685:12)",
"Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:116:18)",
"callNextListener (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:96:12)",
"IncomingMessage.onEnd (/var/task/node_modules/aws-sdk/lib/event_listeners.js:294:13)",
"emitNone (events.js:111:20)",
"IncomingMessage.emit (events.js:208:7)"
]
}
Looks like I can see pretty much the same thing in the Rollbar logs
@iDVB Looks like this is exactly the problem that you anticipated. There are two things to do here:
(1) Fix the deploy, so if the log group doesn't exist, it shouldn't subscribe. It should also look for the us-east-1.*
log group and subscribe to that one if it exists.
(2) Make it so that cloudtrail triggered subscriptions happens for log groups that are prefixed with region.
Note: At that point (2) it might be desirable to also prevent subscription if already subscribed (that's https://github.com/blackflux/lambda-monitor/issues/361), but that could also happen later.
@MacMcIrish @simlu Any reason lambda.js
isn't using AWS.REQUEST.promise().then()
rather then creating your own promises? https://aws.amazon.com/blogs/developer/support-for-promises-in-the-sdk/
@iDVB Not really. That should probably be changed if the tests just continue working. I've already done that in other places last night.
Working on a PR for this now... https://github.com/blackflux/lambda-monitor/pull/371
Task outline:
(1) CloudTrail needs to be getting events from all regions This is necessary so we can do actions based on edge lambda function log group creation.
(2) The subscribe and set-retention need to be updated to handle "CloudTrail log group events" and "non CloudTrail log groups events" differently. For CloudTrail events only the log group that triggered the invocation should be handled. Otherwise we need to do an "expensive" check all call. This is necessary so we do not overload the system with requests, otherwise this could get expensive and we will run into throttling.
(3) Handle edge log group subscription correctly (when an edge log group gets created we need to check the master function for tags etc). This should be relatively easy, we just need to check if this is an edge log group and check the master function for tags.
I would start by figuring out what the lambda event for the CloudTrail log group creation event looks like (i.e. how much meta information is available).
Going to test to see what we can see in the CT event by letting it trigger a function with
console.log(JSON.stringify(event, null, 2))
in it.
I'm still in the process of testing out what exactly the issue might be but I have a good hunch that the service does not know how to deal with edge function logs.
Likely this is because they're located in the region that the end user was closest to. (edge location) This means that when an end user makes a request near Oregon.... then the logs for that lambda will end up in the Oregon region.
Additionally, the log prefix is different as well in that its:
/aws/lambda/us-east-1.demo-stack-proxy-dev-originRequest
vs the traditional.../aws/lambda/demo-stack-proxy-dev-originRequest
I thought I might be able to shim the issue by running around and putting
MONITORED=0
on all the replica functions... but as it appears (from the console anyway) you can't tag replica functions.I'm not sure how you can even programmatically detect if they are replica functions.... looking into that now.
UPDATE: You can't seem to see the function when you make a
aws lambda list-functions
but you can see it when youaws lambda get-function
directly. However, there does not seem to be anyreplica
identifying traits other then all replica function names start withus-east-1.
regardless of what region a developer usually uses. Edge function masters always get created inus-east-1
.I'm going to start looking into your source to see if I can just add a check for
us-east-1.
as a prefix in the function name and exclude it for now. Later, maybe there is a way to even have these logging by adding subscription filters to the region related logs.