DataDog / serverless-plugin-datadog

Serverless plugin to automagically instrument your Lambda functions with Datadog
Apache License 2.0
96 stars 49 forks source link

BROKEN INTEGRATION - Task timed out on AWS Lambda Serverless #226

Closed joaomirandas closed 2 years ago

joaomirandas commented 2 years ago

Expected Behavior

Receive logs from my AWS Lambda serverless into datadog, using log forwarder - those lambdas should use persistent-connections with databases like mongoDB or Redis, witch means, lambda returns response without close all event loop actions.

Actual Behavior

All my lambdas where i have persistent connection with database using context.callbackWaitsForEmptyEventLoop = false; was closed with timed-out behavior, processing steps was made but lambda never returns any response, exactly like context sentence above did not exists

Steps to Reproduce the Problem

  1. install log forwarder using this sentence:

    custom:
    controlDD:
      dev: false
      hml: true
      app: true
    datadog:
      enabled: ${self:custom.controlDD.${opt:stage, 'dev'}, false}
      enableDDTracing: true
      addLayers: true
      flushMetricsToLogs: true
      logLevel: "INFO"
      forwarder: arn:aws:lambda:sa-east-1:*************:function:DatadogServicesForwarder
  2. Once installed create some lambda setting some database persistent connection like mongoDB, to do this you will need create an connection with database out of handler and set this variable as your connection to act like an cache-connection. By default lambda not return any response if you have external connections open, you generally need to call close connection to receive any response in callback BUT if you set line below you can do this trick.

    module.exports.handler = async (event, context, callback) => {
       context.callbackWaitsForEmptyEventLoop = false; // <-- This line do that trick
       ///[...YourCode...]
    };

    Learn more about this here

  3. Call this endpoint with some opened external connection and check, you will never receive any response and lambda will close with TIMED-OUT.

Specifications

Stacktrace

   Task timed out after 10.01 seconds
joaomirandas commented 2 years ago

@agocs - please check-it! Your commit made my prodution enviroments not working in a middle of day. I sent you an invite over LinkedIn too, if you need special information to fix just let me know.

agocs commented 2 years ago

Good morning @joaomirandasa , thank you for opening this issue. I'm sorry it took down your production environment. Off the top of my head, Serverless plugin 3.7.0 comes with a new version of dd-trace-js and datadog-lambda-js, but the new datadog-lambda-js shouldn't change any behavior with MongoDB.

I'm glad that pinning the dependency worked for you.

We'll try to reproduce this issue next week. To verify, the steps are:

correct?

Which version and architecture of the Node runtime are you using? How much memory is configured?

joaomirandas commented 2 years ago

@agocs - Yes! Exactly this process!

I find another issue too, when i user sns as event source for lambdas with redrive-policy associated, all events received in sns topic was redirected to deadLetterQueue. In my point of view sounds like datadog-plugin is changing something in .yml file witch causing anomalies in all functions

      - sns:
          arn: !Ref eventTopic
          topicName: eventTopic-${self:provider.stage}
          filterPolicy:
            event:
              - event:named
          redrivePolicy:
            deadLetterTargetRef: dispatchMTMessageQueue
JNussenzveig commented 2 years ago

That's interesting, it also happened to me! Under the same circumstances. Currently @watching this issue.

astuyve commented 2 years ago

@joaomirandasa Let's try to keep issues separate, I've opened another issue for your latest comment here: https://github.com/DataDog/serverless-plugin-datadog/issues/228

@JNussenzveig are you also experiencing this reported timeout issue? Can you add the clarifying terms requested by Chris?

Perhaps it would be helpful to Chris for replication purposes to detail a bit more about your handler code. Where do you call the callback? What else is your handler doing before/after the call? Can you help us understand your usage of callbackWaitsForEmptyEventLoop?

Thank you!

fabiofontesx commented 2 years ago

@astuyve The same happens to me, with exactly same scenario reported by @joaomirandasa .

Serverless Framework version: 3.2.0 Datadog Serverless Plugin version: Version 3.7.0 Lambda function runtime (Python 3.7, Node 10, etc.): runtime: nodejs12.x Expert Details: using external and cached connection with mongoDB over Mongoose Starts to happens in version 3.7.0 - to my services keep working i freezed version at serverless-plugin-datadog@3.5.3

JNussenzveig commented 2 years ago

@astuyve

Serverless Framework version: 3.2.0 Datadog Serverless Plugin version: Version 3.7.0 Lambda function runtime (Python 3.9, Node 12, etc.): runtime: nodejs12.x Expert Details: using external connection with mongoDB over Mongoose

PROBLEMATIC: serverless-plugin-datadog@3.7.0 FROZEN: serverless-plugin-datadog@3.5.3

astuyve commented 2 years ago

Thanks! Since you (@fabiofontesx @JNussenzveig @joaomirandasa) seem to be using the exact same setup, I presume you're working together on the same app? Could someone help clarify the additional questions I've added? It would be helpful for just one person to reply, to prevent comment clutter.

Where do you call the callback? What else is your handler doing before/after the call? Can you help us understand your usage of callbackWaitsForEmptyEventLoop?

Thanks!

joaomirandas commented 2 years ago

Hi @astuyve,

Yes, every information that you need to go futher with this solution - Usage of callbackWaitsForEmptyEventLoop can be very clarified at AWS Documentation but i will try to explain our application scenarios here.

We have a lot of lambdas connecting with database in high request rates, something like 30req/seconds - it's not good open an new MongoDB connection to each of those requests, to handle with this we create a connection when lambda comes from cold-start to warmed. We do this using a code similar to below:

const mongoose = require('/opt/node_modules/mongoose');
var Schema = mongoose.Schema;  
var cachedDBS = {};

function connectToDatabase (_tenant,_mongoDB) {
  return new Promise(async (resolve) =>{
    console.log(JSON.stringify({state: "[CACHED-CONNECTIONS]", cachedConnection: Object.keys(cachedDBS) }));
    if (cachedDBS[_tenant]) {
      console.log(JSON.stringify({state: "[USING_CACHE_MONGO_CONN]"}));
      return resolve(cachedDBS[_tenant]);
    };
    console.log(JSON.stringify({state: "[CREATE_NEW_MONGO_CONN]"}));
    const conn_db = mongoose.createConnection(_mongoDB, {
      bufferCommands: false,
      bufferMaxEntries: 0,
      useNewUrlParser: true,
      useUnifiedTopology: true,
      useFindAndModify: false,
      useCreateIndex: true,
      poolSize: 1, 
      socketTimeoutMS: 2000000, 
      keepAlive: true
    });
    await conn_db;
    conn_db.model('DB_LOGS', new mongoose.Schema({
      eventAction: {type:String, require: true},
      eventID: {type:String, require: true}
    }, {collection: 'logs', minimize: false}));;
    cachedDBS[_tenant] = conn_db;
    return resolve(cachedDBS[_tenant]);
  });
};

module.exports.handler = async function(event, context, callback) {
  context.callbackWaitsForEmptyEventLoop = false;
  let $conn = await connectToDatabase('application_id_1', process.env.MONGODB_URI);

  // [LINES_OF_FUNCTION_LOGIC_HERE]
  // [LINES_OF_FUNCTION_LOGIC_HERE]
  // [LINES_OF_FUNCTION_LOGIC_HERE]

  return callback(null, {
      statusCode: 200,
      headers: { 
          "Access-Control-Allow-Origin" : "*", 
          "Access-Control-Allow-Credentials" : true
      },
      body: JSON.stringify({status: true})
    });
};

How this works?

Check the way we cache mongoDB connection in var cachedDBS = {}; to next request in this same lambda, in that case it's not necessary spend time connecting with database anymore, until this lambda goes cold again. But this behaviour is not default by lambdas, in default lambda never let you return callback until all external connections be closed, tho allow this behaviour is necessary add line context.callbackWaitsForEmptyEventLoop = false; - it's like say to lambda "hey, now you are allowed to return callback function with external connections still open"

About this issue

We ever used lambdas with datadog in that form, but in latest update for some reason we never receive callback, something like line context.callbackWaitsForEmptyEventLoop = false; dosn't exists. My opinion is by some reason function handler of datadog is overwriting that sentence or disabling that behaviour.


If you need more informations or some session to debug please get in touch over joao.miranda@timeprimo.com. We also get in touch of DataDog team in Brazil, maybe this request could be received internally in your side too.

astuyve commented 2 years ago

@joaomirandasa that's perfect, thanks so much!

Yes please open an issue with Datadog support so that we can communicate securely, as we may need to ask further questions. This likely isn't related to the plugin, which doesn't actually run in your handler, however it's possibly related to the datadog tracer. The only change between 3.5 and 3.7 is tracer version(s), which are included in the library layers set by this plugin.

Thanks again for your detailed report!

joaomirandas commented 2 years ago

@astuyve Yes! I already open that support ticket (#689712) - that's make sense, i checked version changes and saw those layers. Probably should be something related. I will keep tracking

agocs commented 2 years ago

What do your cloudwatch logs say when you see timeouts? I'm seeing the following message, but I'm seeing it in about 5% of invocations (both with and without serverless-plugin-datadog, and with serverles-plugin-datadog 3.5.3 and 3.7.0):

2022-02-14T20:47:12.840Z    252cf6a8-9ddd-477c-ab64-15e65731e577    ERROR   [dd.trace_id=4650720379527467306 dd.span_id=8058852911194760148] Unhandled Promise Rejection    {
    "errorType": "Runtime.UnhandledPromiseRejection",
    "errorMessage": "MongooseServerSelectionError: Server selection timed out after 30000 ms",
    "reason": {
        "errorType": "MongooseServerSelectionError",
        "errorMessage": "Server selection timed out after 30000 ms",
        "message": "Server selection timed out after 30000 ms",
        "reason": {
            "type": "ReplicaSetNoPrimary",
            "servers": {},
            "stale": false,
            "compatible": true,
            "heartbeatFrequencyMS": 10000,
            "localThresholdMS": 15,
            "setName": "rs0"
        },
        "stack": [
            "MongooseServerSelectionError: Server selection timed out after 30000 ms",
            "    at NativeConnection.Connection.openUri (/var/task/node_modules/mongoose/lib/connection.js:802:32)",
            "    at Mongoose.createConnection (/var/task/node_modules/mongoose/lib/index.js:285:10)",
            "    at /var/task/handler.js:13:30",
            "    at new Promise (<anonymous>)",
            "    at connectToDatabase (/var/task/handler.js:6:10)",
            "    at module.exports.hello (/var/task/handler.js:34:21)",
            "    at /opt/nodejs/node_modules/datadog-lambda-js/utils/handler.js:156:25",
            "    at /opt/nodejs/node_modules/datadog-lambda-js/index.js:175:62",
            "    at step (/opt/nodejs/node_modules/datadog-lambda-js/index.js:44:23)",
            "    at Object.next (/opt/nodejs/node_modules/datadog-lambda-js/index.js:25:53)"
        ]
    },
    "promise": {},
    "stack": [
        "Runtime.UnhandledPromiseRejection: MongooseServerSelectionError: Server selection timed out after 30000 ms",
        "    at process.<anonymous> (/var/runtime/index.js:35:15)",
        "    at process.emit (events.js:314:20)",
        "    at processPromiseRejections (internal/process/promises.js:209:33)",
        "    at processTicksAndRejections (internal/process/task_queues.js:98:32)"
    ]
}
agocs commented 2 years ago

What's also interesting is that I was able to comment out //context.callbackWaitsForEmptyEventLoop = false;, and I got

image

It looks like it's using the cached connection regardless of context.callbackWaitsForEmptyEventLoop

agocs commented 2 years ago

I created a repository with my code here: https://github.com/DataDog/lambda-node-mongodb. It is using

@joaomirandasa Could you please run this in your environment, connected to your MongoDB instance, and see if it gives you the timeout issue you've been experiencing? It doesn't in my environment.

Additionally, to dig into this further, could you send the name of the lambda function where you saw failures to the Datadog support person? I'd like to look at it a little further.

joaomirandas commented 2 years ago

Hi @agocs and @astuyve

Sorry, my bad, i was not clear enough to help you understand. Method of invocation to this lambda is over SNS event. This change everything. Check my screen-capture, this is what shows when this happens with library connected.

Captura de Tela 2022-02-15 às 01 35 54

Study Case

To help you understand i create an replica of our real scenarios here, it's exactly how problem shows up.

Follow those instructions to reproduce

Files

plugins:

custom: accountID: "****" mongoDB: "mongodb+srv://DB_USERNAME:DB_PASSWORD@DB_ADDRESS.mongodb.net/DB_NAME?retryWrites=true&w=majority" controlDD: dev: false hml: true app: true datadog: enabled: ${self:custom.controlDD.${opt:stage}, false} enableDDTracing: true addLayers: true flushMetricsToLogs: true logLevel: "INFO" forwarder: arn:aws:lambda:${self:provider.region}:${self:custom.accountID}:function:datadog-forwarder-Forwarder-2HGPPPR84KO4

provider: name: aws runtime: nodejs12.x stackTags: centro: "grupo" tags: centro: "grupo" stage: ${opt:stage} region: sa-east-1 timeout: 10 memorySize: 256 logRetentionInDays: 7 environment: STAGE: ${self:provider.stage} MONGODB_URI: ${self:custom.mongoDB}

We use VPC's here - but to this file example is better to be commented

vpc:

securityGroupIds:

- sg-0a7662106d8612123

subnetIds:

- subnet-031229237ae712123

apiGateway: shouldStartNameWithService: true versionFunctions: false iam: role: statements:

layers: layerTwo: name: ${self:service}-${self:provider.stage}-layerTwo path: ./layers/layerTwo description: "layer with internal mongoose/underscore" package: patterns:

package: patterns:

functions: debug: handler: debug.handler destinations: onFailure: "arn:aws:sqs:${self:provider.region}:${self:custom.accountID}:debugger-DLQ-${self:provider.stage}" events:

resources: Resources:

debuggerTopic:
  Type: "AWS::SNS::Topic"
  Properties:
    TopicName: debugger-${self:provider.stage}

debuggerQueue:
  Type: "AWS::SQS::Queue"
  Properties:
    QueueName: "debugger-DLQ-${self:provider.stage}"
    MessageRetentionPeriod: 1209600

---

- debug.js
```javascript
const mongoose = require("/opt/node_modules/mongoose");
var cachedDBS = {};

function connectToDatabase(_tenant, _mongoDB) {
  return new Promise(async (resolve) => {
    console.log(
      JSON.stringify({
        state: "[CACHED-CONNECTIONS]",
        cachedConnection: Object.keys(cachedDBS),
      })
    );
    if (cachedDBS[_tenant]) {
      console.log(JSON.stringify({ state: "[USING_CACHE_MONGO_CONN]" }));
      return resolve(cachedDBS[_tenant]);
    }
    console.log(JSON.stringify({ state: "[CREATE_NEW_MONGO_CONN]" }));
    const conn_db = mongoose.createConnection(_mongoDB, {
      bufferCommands: false,
      bufferMaxEntries: 0,
      useNewUrlParser: true,
      useUnifiedTopology: true,
      useFindAndModify: false,
      useCreateIndex: true,
      poolSize: 1,
      socketTimeoutMS: 2000000,
      keepAlive: true,
    });
    await conn_db;
    conn_db.model(
      "DB_LOGS",
      new mongoose.Schema(
        {
          eventAction: { type: String, require: true },
          eventID: { type: String, require: true },
        },
        { collection: "logs", minimize: false }
      )
    );
    cachedDBS[_tenant] = conn_db;
    return resolve(cachedDBS[_tenant]);
  });
}

module.exports.handler = async function (event, context, callback) {
  console.log(event, context);
  context.callbackWaitsForEmptyEventLoop = false;

  try {
    //Connecting to database
    let $conn = await connectToDatabase(
      "application_id_1",
      process.env.MONGODB_URI
    );

    // Simulating error to forwarder events to DLQ
    // throw new Error('FORCING_ERROR_TO_DLQ');

    //Example of save 
    let _preSavingTTL = $conn.model('DB_LOGS')({
      eventAction: 'dd_debug_action',
      eventID: 1,
    });
    let _savingEvent = await _preSavingTTL.save();
    console.log(JSON.stringify({state: "[LOG_SAVE_SUCCESS]"}));
    console.log(JSON.stringify(_savingEvent));

    //Calling callback 
    return callback(null, {
      statusCode: 200,
      headers: {
        "Access-Control-Allow-Origin": "*",
        "Access-Control-Allow-Credentials": true,
      },
      body: JSON.stringify({ status: true }),
    });
  } catch(err){
    console.log(err);
    throw new Error(err);
  }
};


const __SNS = function(_e){ const AWS = require('aws-sdk'); const SNS = new AWS.SNS({apiVersion: '2010-03-31', region: _region}); return SNS.publish({ Message: JSON.stringify(_e.body), MessageAttributes : { "event": { "DataType": "String", "StringValue": _e.event || 'logger:save' }, "businessUnit":{ "DataType": "String", "StringValue": _e.bu || 'DEVELOPER' } }, TopicArn: arn:aws:sns:${_region}:${_accountID}:debugger-hml }).promise() };

async function injectEvent() { console.log(#Notify SNS); let pushEvent = await __SNS({ event: 'debugger:event', bu: 'DEBUGGER', body: { "eventID": "433b86e4-0c5c-4ef0-b155-43c83d15762a", } }) console.log(pushEvent); };

injectEvent();

joaomirandas commented 2 years ago

What do your cloudwatch logs say when you see timeouts? I'm seeing the following message, but I'm seeing it in about 5% of invocations (both with and without serverless-plugin-datadog, and with serverles-plugin-datadog 3.5.3 and 3.7.0):

2022-02-14T20:47:12.840Z  252cf6a8-9ddd-477c-ab64-15e65731e577    ERROR   [dd.trace_id=4650720379527467306 dd.span_id=8058852911194760148] Unhandled Promise Rejection    {
    "errorType": "Runtime.UnhandledPromiseRejection",
    "errorMessage": "MongooseServerSelectionError: Server selection timed out after 30000 ms",
    "reason": {
        "errorType": "MongooseServerSelectionError",
        "errorMessage": "Server selection timed out after 30000 ms",
        "message": "Server selection timed out after 30000 ms",
        "reason": {
            "type": "ReplicaSetNoPrimary",
            "servers": {},
            "stale": false,
            "compatible": true,
            "heartbeatFrequencyMS": 10000,
            "localThresholdMS": 15,
            "setName": "rs0"
        },
        "stack": [
            "MongooseServerSelectionError: Server selection timed out after 30000 ms",
            "    at NativeConnection.Connection.openUri (/var/task/node_modules/mongoose/lib/connection.js:802:32)",
            "    at Mongoose.createConnection (/var/task/node_modules/mongoose/lib/index.js:285:10)",
            "    at /var/task/handler.js:13:30",
            "    at new Promise (<anonymous>)",
            "    at connectToDatabase (/var/task/handler.js:6:10)",
            "    at module.exports.hello (/var/task/handler.js:34:21)",
            "    at /opt/nodejs/node_modules/datadog-lambda-js/utils/handler.js:156:25",
            "    at /opt/nodejs/node_modules/datadog-lambda-js/index.js:175:62",
            "    at step (/opt/nodejs/node_modules/datadog-lambda-js/index.js:44:23)",
            "    at Object.next (/opt/nodejs/node_modules/datadog-lambda-js/index.js:25:53)"
        ]
    },
    "promise": {},
    "stack": [
        "Runtime.UnhandledPromiseRejection: MongooseServerSelectionError: Server selection timed out after 30000 ms",
        "    at process.<anonymous> (/var/runtime/index.js:35:15)",
        "    at process.emit (events.js:314:20)",
        "    at processPromiseRejections (internal/process/promises.js:209:33)",
        "    at processTicksAndRejections (internal/process/task_queues.js:98:32)"
    ]
}

This is another level of error, this is knowed by us, it's related to previous connected instance with mongodb, it's complex to explain but trying to simplyfi, does not matter if you return callback, connection keeps opened in backyards of lambda, when this connection goes down this error appears. It's verry, verry curious, i never had opportunity to go futher in this. Maybe you can help-me to understand how this arrives in datadog logs. Will be very usefull to improve our visibility in APM.

joaomirandas commented 2 years ago

@agocs and @astuyve any updates on this?

agocs commented 2 years ago

@joaomirandasa I'm making some progress. I'll report back with my findings by EOD

agocs commented 2 years ago

@joaomirandasa I believe I'm seeing timeouts with both serverless-plugin-datadog 3.5.3 and 3.7.0, which is unexpected. I will work to verify/fix that tomorrow.

joaomirandas commented 2 years ago

Thank-you for your efforts until now! I still here if you need more information or tests.

agocs commented 2 years ago

OK, this is interesting. I've officially reproduced this. Using serverless-plugin-datadog 3.7.0 and bufferCommands:false, we fail to write to MongoDB and the lambda function times out.

image

serverless-plugin-datadog 3.7.0 and bufferCommands:true, we successfully write to MongoDB and the lambda function times out. It looks like dd-trace-js(?) is keeping the lambda function running.

image

serverless-plugin-datadog 3.5.3 and bufferedCommands:false, we failed to write to MongoDB, but the lambda function exited within 2 seconds.

image

serverless-plugin-datadog 3.5.3 and bufferedCommands:true, we successfully write to Mongo and the function exits in a timely manner

image

agocs commented 2 years ago

Just to satisfy my own curiosity, callbackWaitsForEmptyEventLoop doesn't have any bearing on whether 3.7.0 leaves the lambda function running.

joaomirandas commented 2 years ago

Yeah! this is very interesting, my settings here is using

serverless-plugin-datadog 3.5.3 and bufferedCommands:true

There's some work around there we can implement in dd-trace-js? If you could share your next steps i will be glad

agocs commented 2 years ago

Hey @joaomirandasa , I'm going through different builds of datadog-lambda-js and dd-trace-js to try to figure out what specifically caused this regression.

astuyve commented 2 years ago

Hey @joaomirandasa - we've released a new version of Datadog Lambda JS which should fix this issue, you can try it by upgrading your version of serverless-plugin-datadog to v3.7.1. Please let us know if this fixes the issue for you! https://github.com/DataDog/serverless-plugin-datadog/releases/tag/v3.7.1

astuyve commented 2 years ago

Based on internal conversations with DataDog Customer Support, this issue is resolved with 3.7.1. Thanks!