blackflux / lambda-monitor

Lambda log monitoring and streaming to external services.
MIT License
14 stars 8 forks source link

Replica Function support (Lambda@Edge) #364

Open iDVB opened 5 years ago

iDVB commented 5 years ago

I'm still in the process of testing out what exactly the issue might be but I have a good hunch that the service does not know how to deal with edge function logs.

Likely this is because they're located in the region that the end user was closest to. (edge location) This means that when an end user makes a request near Oregon.... then the logs for that lambda will end up in the Oregon region.

Additionally, the log prefix is different as well in that its: /aws/lambda/us-east-1.demo-stack-proxy-dev-originRequest vs the traditional... /aws/lambda/demo-stack-proxy-dev-originRequest

I thought I might be able to shim the issue by running around and putting MONITORED=0 on all the replica functions... but as it appears (from the console anyway) you can't tag replica functions.

I'm not sure how you can even programmatically detect if they are replica functions.... looking into that now.

UPDATE: You can't seem to see the function when you make a aws lambda list-functions but you can see it when you aws lambda get-function directly. However, there does not seem to be any replica identifying traits other then all replica function names start with us-east-1. regardless of what region a developer usually uses. Edge function masters always get created in us-east-1.

I'm going to start looking into your source to see if I can just add a check for us-east-1. as a prefix in the function name and exclude it for now. Later, maybe there is a way to even have these logging by adding subscription filters to the region related logs.

simlu commented 5 years ago

So I take it deploying this monitor to all regions that could have edge function executions doesn't work (yet)? Targeting region functions log groups shouldn't be hard.

Why do you want to disable monitoring for those functions? Is it more broken than simply not working?

What issues are you seeing? As I said this service works great for us, but we are not currently using edge lambda function.

iDVB commented 5 years ago

No, as is, it seems to be fully broken as I think the Dummy Custom Resources that initially calls those subscribe functions are failing and causing the deploy to fail and roll back. This is maybe to do with it trying to locate a replica's associated log group and failing?

Deploying to all regions might work but is a bit overkill in my mind, especially since the regions that are “likely to have edge functions” is all of them. That’s how they work, they get replicated to all edge locations and sit there ready anytime an end user visits from that area. So it would be impossible to predict.

I'm going to dive into the code a bit more and see if I can see exactly what the issue might be. But let me know if you have any ideas. I'm very happy to test them.

simlu commented 5 years ago

I'd love to see what the cloudformation template / resource lambda function logs look like (you need to be quick to find them before everything is torn down).

Also I'm sure @macmcirish has some input here in what would be interesting to look at!

iDVB commented 5 years ago

AFK, for 1hr then will look for ya. Also, I think I can alter these refs to logs and have it check for replica log groups across regions. https://github.com/blackflux/lambda-monitor/blob/04e8d5ec00c5085388f8223d398b7acd5bfe6966/src/logic/util/lambda.js#L28

MacMcIrish commented 5 years ago

@iDVB Failing on the post deploy hook via a dummy resource is a frustrating issue that I plan to make much easier to understand :) up until now, I was the only one suffering from this, so I'm sorry to put you through my pain. Fortunately this is only an issue on the very first deploy of the stack, as when the initial deploy fails the teardown of the stack also deletes the log group.

To examine the log group before the teardown happens:

  1. Open the Cloudwatch Management Console in the targeted region
  2. Deploy the stack
  3. As the stack is deploying, refresh the page until /aws/lambda/lambda-monitor-${YOUR_STAGE_HERE}-subscribe and /aws/lambda/lambda-monitor-${YOUR_STAGE_HERE}-set-retention appear. Open the log groups immediately.
  4. When the logs are attached to this log group, you can find the error output on why the custom resource failed to deploy.

Once you have this info, it'll be a lot easier to track down the reason that this is failing to stabilise on deploy. Looking forward to hearing back from you!

iDVB commented 5 years ago

Thanks! Actually, it seems to successfully send the error to Rollbar... so I CAN see that its due to

ResourceNotFoundException: The specified log group does not exist.

which led me to hypothesize about it not being able to fine the replica log groups.

Testing your steps to catch the logs though... OUTPUT:

START RequestId: NOTTELLING Version: $LATEST
2018-10-26T17:36:58.132Z    NOTTELLING
    "Status": "FAILED",
    "Reason": "See the details in CloudWatch Log Stream: 2018/10/26/[$LATEST]NOTTELLING",
    "PhysicalResourceId": "2018/10/26/[$LATEST]NOTTELLING",
    "StackId": "arn:aws:cloudformation:us-east-1:NOTTELLING:stack/lambda-monitor-dev/NOTTELLING",
    "RequestId": "NOTTELLING",
    "LogicalResourceId": "TriggerLambdaSubscribeResource",
    "Data": {}
}

2018-10-26T17:36:58.370Z    NOTTELLING  Status code: 200
2018-10-26T17:36:58.370Z    NOTTELLING  Status message: OK
2018-10-26T17:36:58.607Z    NOTTELLING
{
    "errorMessage": "The specified log group does not exist.",
    "errorType": "ResourceNotFoundException",
    "stackTrace": [
        "Request.extractError (/var/task/node_modules/aws-sdk/lib/protocol/json.js:48:27)",
        "Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:106:20)",
        "Request.emit (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:78:10)",
        "Request.emit (/var/task/node_modules/aws-sdk/lib/request.js:683:14)",
        "Request.transition (/var/task/node_modules/aws-sdk/lib/request.js:22:10)",
        "AcceptorStateMachine.runTo (/var/task/node_modules/aws-sdk/lib/state_machine.js:14:12)",
        "/var/task/node_modules/aws-sdk/lib/state_machine.js:26:10",
        "Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:38:9)",
        "Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:685:12)",
        "Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:116:18)",
        "Request.emit (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:78:10)",
        "Request.emit (/var/task/node_modules/aws-sdk/lib/request.js:683:14)",
        "Request.transition (/var/task/node_modules/aws-sdk/lib/request.js:22:10)",
        "AcceptorStateMachine.runTo (/var/task/node_modules/aws-sdk/lib/state_machine.js:14:12)",
        "/var/task/node_modules/aws-sdk/lib/state_machine.js:26:10",
        "Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:38:9)",
        "Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:685:12)",
        "Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:116:18)",
        "callNextListener (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:96:12)",
        "IncomingMessage.onEnd (/var/task/node_modules/aws-sdk/lib/event_listeners.js:294:13)",
        "emitNone (events.js:111:20)",
        "IncomingMessage.emit (events.js:208:7)"
    ]
}

END RequestId: NOTTELLING
REPORT RequestId: NOTTELLING    Duration: 2351.20 ms    Billed Duration: 2400 ms Memory Size: 256 MB    Max Memory Used: 57 MB  
START RequestId: NOTTELLING Version: $LATEST
2018-10-26T17:37:13.071Z    NOTTELLING  Response body:
{
    "Status": "FAILED",
    "Reason": "See the details in CloudWatch Log Stream: 2018/10/26/[$LATEST]NOTTELLING",
    "PhysicalResourceId": "2018/10/26/[$LATEST]NOTTELLING",
    "StackId": "arn:aws:cloudformation:us-east-1:NOTTELLING:stack/lambda-monitor-dev/NOTTELLING",
    "RequestId": "NOTTELLING",
    "LogicalResourceId": "TriggerLambdaSubscribeResource",
    "Data": {}
}

2018-10-26T17:37:13.529Z    NOTTELLING  Status code: 200
2018-10-26T17:37:13.529Z    NOTTELLING  Status message: OK
2018-10-26T17:37:13.711Z    NOTTELLING
{
    "errorMessage": "The specified log group does not exist.",
    "errorType": "ResourceNotFoundException",
    "stackTrace": [
        "Request.extractError (/var/task/node_modules/aws-sdk/lib/protocol/json.js:48:27)",
        "Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:106:20)",
        "Request.emit (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:78:10)",
        "Request.emit (/var/task/node_modules/aws-sdk/lib/request.js:683:14)",
        "Request.transition (/var/task/node_modules/aws-sdk/lib/request.js:22:10)",
        "AcceptorStateMachine.runTo (/var/task/node_modules/aws-sdk/lib/state_machine.js:14:12)",
        "/var/task/node_modules/aws-sdk/lib/state_machine.js:26:10",
        "Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:38:9)",
        "Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:685:12)",
        "Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:116:18)",
        "Request.emit (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:78:10)",
        "Request.emit (/var/task/node_modules/aws-sdk/lib/request.js:683:14)",
        "Request.transition (/var/task/node_modules/aws-sdk/lib/request.js:22:10)",
        "AcceptorStateMachine.runTo (/var/task/node_modules/aws-sdk/lib/state_machine.js:14:12)",
        "/var/task/node_modules/aws-sdk/lib/state_machine.js:26:10",
        "Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:38:9)",
        "Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:685:12)",
        "Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:116:18)",
        "callNextListener (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:96:12)",
        "IncomingMessage.onEnd (/var/task/node_modules/aws-sdk/lib/event_listeners.js:294:13)",
        "emitNone (events.js:111:20)",
        "IncomingMessage.emit (events.js:208:7)"
    ]
}
iDVB commented 5 years ago

Looks like I can see pretty much the same thing in the Rollbar logs

simlu commented 5 years ago

@iDVB Looks like this is exactly the problem that you anticipated. There are two things to do here: (1) Fix the deploy, so if the log group doesn't exist, it shouldn't subscribe. It should also look for the us-east-1.* log group and subscribe to that one if it exists. (2) Make it so that cloudtrail triggered subscriptions happens for log groups that are prefixed with region.

Note: At that point (2) it might be desirable to also prevent subscription if already subscribed (that's https://github.com/blackflux/lambda-monitor/issues/361), but that could also happen later.

iDVB commented 5 years ago

@MacMcIrish @simlu Any reason lambda.js isn't using AWS.REQUEST.promise().then() rather then creating your own promises? https://aws.amazon.com/blogs/developer/support-for-promises-in-the-sdk/

simlu commented 5 years ago

@iDVB Not really. That should probably be changed if the tests just continue working. I've already done that in other places last night.

iDVB commented 5 years ago

Working on a PR for this now... https://github.com/blackflux/lambda-monitor/pull/371

simlu commented 5 years ago

Task outline:

(1) CloudTrail needs to be getting events from all regions This is necessary so we can do actions based on edge lambda function log group creation.

(2) The subscribe and set-retention need to be updated to handle "CloudTrail log group events" and "non CloudTrail log groups events" differently. For CloudTrail events only the log group that triggered the invocation should be handled. Otherwise we need to do an "expensive" check all call. This is necessary so we do not overload the system with requests, otherwise this could get expensive and we will run into throttling.

(3) Handle edge log group subscription correctly (when an edge log group gets created we need to check the master function for tags etc). This should be relatively easy, we just need to check if this is an edge log group and check the master function for tags.


I would start by figuring out what the lambda event for the CloudTrail log group creation event looks like (i.e. how much meta information is available).

iDVB commented 5 years ago

Going to test to see what we can see in the CT event by letting it trigger a function with console.log(JSON.stringify(event, null, 2)) in it.