Closed joaomirandas closed 2 years ago
@agocs - please check-it! Your commit made my prodution enviroments not working in a middle of day. I sent you an invite over LinkedIn too, if you need special information to fix just let me know.
Good morning @joaomirandasa , thank you for opening this issue. I'm sorry it took down your production environment. Off the top of my head, Serverless plugin 3.7.0 comes with a new version of dd-trace-js and datadog-lambda-js, but the new datadog-lambda-js shouldn't change any behavior with MongoDB.
I'm glad that pinning the dependency worked for you.
We'll try to reproduce this issue next week. To verify, the steps are:
context.callbackWaitsForEmptyEventLoop = false;
correct?
Which version and architecture of the Node runtime are you using? How much memory is configured?
@agocs - Yes! Exactly this process!
I find another issue too, when i user sns as event source for lambdas with redrive-policy associated, all events received in sns topic was redirected to deadLetterQueue. In my point of view sounds like datadog-plugin is changing something in .yml
file witch causing anomalies in all functions
- sns:
arn: !Ref eventTopic
topicName: eventTopic-${self:provider.stage}
filterPolicy:
event:
- event:named
redrivePolicy:
deadLetterTargetRef: dispatchMTMessageQueue
That's interesting, it also happened to me! Under the same circumstances. Currently @watching this issue.
@joaomirandasa Let's try to keep issues separate, I've opened another issue for your latest comment here: https://github.com/DataDog/serverless-plugin-datadog/issues/228
@JNussenzveig are you also experiencing this reported timeout issue? Can you add the clarifying terms requested by Chris?
Perhaps it would be helpful to Chris for replication purposes to detail a bit more about your handler code. Where do you call the callback? What else is your handler doing before/after the call? Can you help us understand your usage of callbackWaitsForEmptyEventLoop
?
Thank you!
@astuyve The same happens to me, with exactly same scenario reported by @joaomirandasa .
Serverless Framework version: 3.2.0 Datadog Serverless Plugin version: Version 3.7.0 Lambda function runtime (Python 3.7, Node 10, etc.): runtime: nodejs12.x Expert Details: using external and cached connection with mongoDB over Mongoose Starts to happens in version 3.7.0 - to my services keep working i freezed version at serverless-plugin-datadog@3.5.3
@astuyve
Serverless Framework version: 3.2.0 Datadog Serverless Plugin version: Version 3.7.0 Lambda function runtime (Python 3.9, Node 12, etc.): runtime: nodejs12.x Expert Details: using external connection with mongoDB over Mongoose
PROBLEMATIC: serverless-plugin-datadog@3.7.0 FROZEN: serverless-plugin-datadog@3.5.3
Thanks! Since you (@fabiofontesx @JNussenzveig @joaomirandasa) seem to be using the exact same setup, I presume you're working together on the same app? Could someone help clarify the additional questions I've added? It would be helpful for just one person to reply, to prevent comment clutter.
Where do you call the callback? What else is your handler doing before/after the call? Can you help us understand your usage of callbackWaitsForEmptyEventLoop?
Thanks!
Hi @astuyve,
Yes, every information that you need to go futher with this solution - Usage of callbackWaitsForEmptyEventLoop
can be very clarified at AWS Documentation but i will try to explain our application scenarios here.
We have a lot of lambdas connecting with database in high request rates, something like 30req/seconds - it's not good open an new MongoDB connection to each of those requests, to handle with this we create a connection when lambda comes from cold-start to warmed. We do this using a code similar to below:
const mongoose = require('/opt/node_modules/mongoose');
var Schema = mongoose.Schema;
var cachedDBS = {};
function connectToDatabase (_tenant,_mongoDB) {
return new Promise(async (resolve) =>{
console.log(JSON.stringify({state: "[CACHED-CONNECTIONS]", cachedConnection: Object.keys(cachedDBS) }));
if (cachedDBS[_tenant]) {
console.log(JSON.stringify({state: "[USING_CACHE_MONGO_CONN]"}));
return resolve(cachedDBS[_tenant]);
};
console.log(JSON.stringify({state: "[CREATE_NEW_MONGO_CONN]"}));
const conn_db = mongoose.createConnection(_mongoDB, {
bufferCommands: false,
bufferMaxEntries: 0,
useNewUrlParser: true,
useUnifiedTopology: true,
useFindAndModify: false,
useCreateIndex: true,
poolSize: 1,
socketTimeoutMS: 2000000,
keepAlive: true
});
await conn_db;
conn_db.model('DB_LOGS', new mongoose.Schema({
eventAction: {type:String, require: true},
eventID: {type:String, require: true}
}, {collection: 'logs', minimize: false}));;
cachedDBS[_tenant] = conn_db;
return resolve(cachedDBS[_tenant]);
});
};
module.exports.handler = async function(event, context, callback) {
context.callbackWaitsForEmptyEventLoop = false;
let $conn = await connectToDatabase('application_id_1', process.env.MONGODB_URI);
// [LINES_OF_FUNCTION_LOGIC_HERE]
// [LINES_OF_FUNCTION_LOGIC_HERE]
// [LINES_OF_FUNCTION_LOGIC_HERE]
return callback(null, {
statusCode: 200,
headers: {
"Access-Control-Allow-Origin" : "*",
"Access-Control-Allow-Credentials" : true
},
body: JSON.stringify({status: true})
});
};
Check the way we cache mongoDB connection in var cachedDBS = {};
to next request in this same lambda, in that case it's not necessary spend time connecting with database anymore, until this lambda goes cold again. But this behaviour is not default by lambdas, in default lambda never let you return callback until all external connections be closed, tho allow this behaviour is necessary add line context.callbackWaitsForEmptyEventLoop = false;
- it's like say to lambda "hey, now you are allowed to return callback function with external connections still open"
We ever used lambdas with datadog in that form, but in latest update for some reason we never receive callback, something like line context.callbackWaitsForEmptyEventLoop = false;
dosn't exists. My opinion is by some reason function handler of datadog is overwriting that sentence or disabling that behaviour.
If you need more informations or some session to debug please get in touch over joao.miranda@timeprimo.com. We also get in touch of DataDog team in Brazil, maybe this request could be received internally in your side too.
@joaomirandasa that's perfect, thanks so much!
Yes please open an issue with Datadog support so that we can communicate securely, as we may need to ask further questions. This likely isn't related to the plugin, which doesn't actually run in your handler, however it's possibly related to the datadog tracer. The only change between 3.5 and 3.7 is tracer version(s), which are included in the library layers set by this plugin.
Thanks again for your detailed report!
@astuyve Yes! I already open that support ticket (#689712) - that's make sense, i checked version changes and saw those layers. Probably should be something related. I will keep tracking
What do your cloudwatch logs say when you see timeouts? I'm seeing the following message, but I'm seeing it in about 5% of invocations (both with and without serverless-plugin-datadog, and with serverles-plugin-datadog 3.5.3 and 3.7.0):
2022-02-14T20:47:12.840Z 252cf6a8-9ddd-477c-ab64-15e65731e577 ERROR [dd.trace_id=4650720379527467306 dd.span_id=8058852911194760148] Unhandled Promise Rejection {
"errorType": "Runtime.UnhandledPromiseRejection",
"errorMessage": "MongooseServerSelectionError: Server selection timed out after 30000 ms",
"reason": {
"errorType": "MongooseServerSelectionError",
"errorMessage": "Server selection timed out after 30000 ms",
"message": "Server selection timed out after 30000 ms",
"reason": {
"type": "ReplicaSetNoPrimary",
"servers": {},
"stale": false,
"compatible": true,
"heartbeatFrequencyMS": 10000,
"localThresholdMS": 15,
"setName": "rs0"
},
"stack": [
"MongooseServerSelectionError: Server selection timed out after 30000 ms",
" at NativeConnection.Connection.openUri (/var/task/node_modules/mongoose/lib/connection.js:802:32)",
" at Mongoose.createConnection (/var/task/node_modules/mongoose/lib/index.js:285:10)",
" at /var/task/handler.js:13:30",
" at new Promise (<anonymous>)",
" at connectToDatabase (/var/task/handler.js:6:10)",
" at module.exports.hello (/var/task/handler.js:34:21)",
" at /opt/nodejs/node_modules/datadog-lambda-js/utils/handler.js:156:25",
" at /opt/nodejs/node_modules/datadog-lambda-js/index.js:175:62",
" at step (/opt/nodejs/node_modules/datadog-lambda-js/index.js:44:23)",
" at Object.next (/opt/nodejs/node_modules/datadog-lambda-js/index.js:25:53)"
]
},
"promise": {},
"stack": [
"Runtime.UnhandledPromiseRejection: MongooseServerSelectionError: Server selection timed out after 30000 ms",
" at process.<anonymous> (/var/runtime/index.js:35:15)",
" at process.emit (events.js:314:20)",
" at processPromiseRejections (internal/process/promises.js:209:33)",
" at processTicksAndRejections (internal/process/task_queues.js:98:32)"
]
}
What's also interesting is that I was able to comment out //context.callbackWaitsForEmptyEventLoop = false;
, and I got
It looks like it's using the cached connection regardless of context.callbackWaitsForEmptyEventLoop
I created a repository with my code here: https://github.com/DataDog/lambda-node-mongodb. It is using
context.callbackWaitsForEmptyEventLoop = false
(but it seems to work the same whether this is set or not)@joaomirandasa Could you please run this in your environment, connected to your MongoDB instance, and see if it gives you the timeout issue you've been experiencing? It doesn't in my environment.
Additionally, to dig into this further, could you send the name of the lambda function where you saw failures to the Datadog support person? I'd like to look at it a little further.
Hi @agocs and @astuyve
Sorry, my bad, i was not clear enough to help you understand. Method of invocation to this lambda is over SNS event. This change everything. Check my screen-capture, this is what shows when this happens with library connected.
To help you understand i create an replica of our real scenarios here, it's exactly how problem shows up.
sls deploy --stage hml
node test.js
controlDD.hml
to false
like below
custom:
accountID: "************"
mongoDB: "mongodb+srv://DB_USERNAME:DB_PASSWORD@DB_ADDRESS.mongodb.net/DB_NAME?retryWrites=true&w=majority"
controlDD:
dev: false
hml: false
app: true
datadog:
enabled: ${self:custom.controlDD.${opt:stage}, false}
enableDDTracing: true
addLayers: true
flushMetricsToLogs: true
logLevel: "INFO"
forwarder: arn:aws:lambda:${self:provider.region}:${self:custom.accountID}:function:datadog-forwarder-Forwarder-2HGPPPR84KO4
node test.js
again and check how everything goes well.
service: dd-debug
plugins:
custom: accountID: "****" mongoDB: "mongodb+srv://DB_USERNAME:DB_PASSWORD@DB_ADDRESS.mongodb.net/DB_NAME?retryWrites=true&w=majority" controlDD: dev: false hml: true app: true datadog: enabled: ${self:custom.controlDD.${opt:stage}, false} enableDDTracing: true addLayers: true flushMetricsToLogs: true logLevel: "INFO" forwarder: arn:aws:lambda:${self:provider.region}:${self:custom.accountID}:function:datadog-forwarder-Forwarder-2HGPPPR84KO4
provider: name: aws runtime: nodejs12.x stackTags: centro: "grupo" tags: centro: "grupo" stage: ${opt:stage} region: sa-east-1 timeout: 10 memorySize: 256 logRetentionInDays: 7 environment: STAGE: ${self:provider.stage} MONGODB_URI: ${self:custom.mongoDB}
apiGateway: shouldStartNameWithService: true versionFunctions: false iam: role: statements:
layers: layerTwo: name: ${self:service}-${self:provider.stage}-layerTwo path: ./layers/layerTwo description: "layer with internal mongoose/underscore" package: patterns:
package: patterns:
functions: debug: handler: debug.handler destinations: onFailure: "arn:aws:sqs:${self:provider.region}:${self:custom.accountID}:debugger-DLQ-${self:provider.stage}" events:
resources: Resources:
debuggerTopic:
Type: "AWS::SNS::Topic"
Properties:
TopicName: debugger-${self:provider.stage}
debuggerQueue:
Type: "AWS::SQS::Queue"
Properties:
QueueName: "debugger-DLQ-${self:provider.stage}"
MessageRetentionPeriod: 1209600
---
- debug.js
```javascript
const mongoose = require("/opt/node_modules/mongoose");
var cachedDBS = {};
function connectToDatabase(_tenant, _mongoDB) {
return new Promise(async (resolve) => {
console.log(
JSON.stringify({
state: "[CACHED-CONNECTIONS]",
cachedConnection: Object.keys(cachedDBS),
})
);
if (cachedDBS[_tenant]) {
console.log(JSON.stringify({ state: "[USING_CACHE_MONGO_CONN]" }));
return resolve(cachedDBS[_tenant]);
}
console.log(JSON.stringify({ state: "[CREATE_NEW_MONGO_CONN]" }));
const conn_db = mongoose.createConnection(_mongoDB, {
bufferCommands: false,
bufferMaxEntries: 0,
useNewUrlParser: true,
useUnifiedTopology: true,
useFindAndModify: false,
useCreateIndex: true,
poolSize: 1,
socketTimeoutMS: 2000000,
keepAlive: true,
});
await conn_db;
conn_db.model(
"DB_LOGS",
new mongoose.Schema(
{
eventAction: { type: String, require: true },
eventID: { type: String, require: true },
},
{ collection: "logs", minimize: false }
)
);
cachedDBS[_tenant] = conn_db;
return resolve(cachedDBS[_tenant]);
});
}
module.exports.handler = async function (event, context, callback) {
console.log(event, context);
context.callbackWaitsForEmptyEventLoop = false;
try {
//Connecting to database
let $conn = await connectToDatabase(
"application_id_1",
process.env.MONGODB_URI
);
// Simulating error to forwarder events to DLQ
// throw new Error('FORCING_ERROR_TO_DLQ');
//Example of save
let _preSavingTTL = $conn.model('DB_LOGS')({
eventAction: 'dd_debug_action',
eventID: 1,
});
let _savingEvent = await _preSavingTTL.save();
console.log(JSON.stringify({state: "[LOG_SAVE_SUCCESS]"}));
console.log(JSON.stringify(_savingEvent));
//Calling callback
return callback(null, {
statusCode: 200,
headers: {
"Access-Control-Allow-Origin": "*",
"Access-Control-Allow-Credentials": true,
},
body: JSON.stringify({ status: true }),
});
} catch(err){
console.log(err);
throw new Error(err);
}
};
{
"dependencies": {
"aws-sdk": "^2.1001.0",
"serverless": "^3.2.0",
"serverless-plugin-datadog": "^3.5.3"
}
}
const _region = 'sa-east-1';
const _accountID = '************';
const __SNS = function(_e){
const AWS = require('aws-sdk');
const SNS = new AWS.SNS({apiVersion: '2010-03-31', region: _region});
return SNS.publish({
Message: JSON.stringify(_e.body),
MessageAttributes : {
"event": {
"DataType": "String",
"StringValue": _e.event || 'logger:save'
},
"businessUnit":{
"DataType": "String",
"StringValue": _e.bu || 'DEVELOPER'
}
},
TopicArn: arn:aws:sns:${_region}:${_accountID}:debugger-hml
}).promise()
};
async function injectEvent() {
console.log(#Notify SNS
);
let pushEvent = await __SNS({
event: 'debugger:event',
bu: 'DEBUGGER',
body: {
"eventID": "433b86e4-0c5c-4ef0-b155-43c83d15762a",
}
})
console.log(pushEvent);
};
injectEvent();
What do your cloudwatch logs say when you see timeouts? I'm seeing the following message, but I'm seeing it in about 5% of invocations (both with and without serverless-plugin-datadog, and with serverles-plugin-datadog 3.5.3 and 3.7.0):
2022-02-14T20:47:12.840Z 252cf6a8-9ddd-477c-ab64-15e65731e577 ERROR [dd.trace_id=4650720379527467306 dd.span_id=8058852911194760148] Unhandled Promise Rejection { "errorType": "Runtime.UnhandledPromiseRejection", "errorMessage": "MongooseServerSelectionError: Server selection timed out after 30000 ms", "reason": { "errorType": "MongooseServerSelectionError", "errorMessage": "Server selection timed out after 30000 ms", "message": "Server selection timed out after 30000 ms", "reason": { "type": "ReplicaSetNoPrimary", "servers": {}, "stale": false, "compatible": true, "heartbeatFrequencyMS": 10000, "localThresholdMS": 15, "setName": "rs0" }, "stack": [ "MongooseServerSelectionError: Server selection timed out after 30000 ms", " at NativeConnection.Connection.openUri (/var/task/node_modules/mongoose/lib/connection.js:802:32)", " at Mongoose.createConnection (/var/task/node_modules/mongoose/lib/index.js:285:10)", " at /var/task/handler.js:13:30", " at new Promise (<anonymous>)", " at connectToDatabase (/var/task/handler.js:6:10)", " at module.exports.hello (/var/task/handler.js:34:21)", " at /opt/nodejs/node_modules/datadog-lambda-js/utils/handler.js:156:25", " at /opt/nodejs/node_modules/datadog-lambda-js/index.js:175:62", " at step (/opt/nodejs/node_modules/datadog-lambda-js/index.js:44:23)", " at Object.next (/opt/nodejs/node_modules/datadog-lambda-js/index.js:25:53)" ] }, "promise": {}, "stack": [ "Runtime.UnhandledPromiseRejection: MongooseServerSelectionError: Server selection timed out after 30000 ms", " at process.<anonymous> (/var/runtime/index.js:35:15)", " at process.emit (events.js:314:20)", " at processPromiseRejections (internal/process/promises.js:209:33)", " at processTicksAndRejections (internal/process/task_queues.js:98:32)" ] }
This is another level of error, this is knowed by us, it's related to previous connected instance with mongodb, it's complex to explain but trying to simplyfi, does not matter if you return callback, connection keeps opened in backyards of lambda, when this connection goes down this error appears. It's verry, verry curious, i never had opportunity to go futher in this. Maybe you can help-me to understand how this arrives in datadog logs. Will be very usefull to improve our visibility in APM.
@agocs and @astuyve any updates on this?
@joaomirandasa I'm making some progress. I'll report back with my findings by EOD
@joaomirandasa I believe I'm seeing timeouts with both serverless-plugin-datadog 3.5.3 and 3.7.0, which is unexpected. I will work to verify/fix that tomorrow.
Thank-you for your efforts until now! I still here if you need more information or tests.
OK, this is interesting. I've officially reproduced this. Using serverless-plugin-datadog 3.7.0 and bufferCommands:false, we fail to write to MongoDB and the lambda function times out.
serverless-plugin-datadog 3.7.0 and bufferCommands:true, we successfully write to MongoDB and the lambda function times out. It looks like dd-trace-js(?) is keeping the lambda function running.
serverless-plugin-datadog 3.5.3 and bufferedCommands:false, we failed to write to MongoDB, but the lambda function exited within 2 seconds.
serverless-plugin-datadog 3.5.3 and bufferedCommands:true, we successfully write to Mongo and the function exits in a timely manner
Just to satisfy my own curiosity, callbackWaitsForEmptyEventLoop
doesn't have any bearing on whether 3.7.0 leaves the lambda function running.
Yeah! this is very interesting, my settings here is using
serverless-plugin-datadog 3.5.3 and bufferedCommands:true
There's some work around there we can implement in dd-trace-js? If you could share your next steps i will be glad
Hey @joaomirandasa , I'm going through different builds of datadog-lambda-js and dd-trace-js to try to figure out what specifically caused this regression.
Hey @joaomirandasa - we've released a new version of Datadog Lambda JS which should fix this issue, you can try it by upgrading your version of serverless-plugin-datadog
to v3.7.1. Please let us know if this fixes the issue for you!
https://github.com/DataDog/serverless-plugin-datadog/releases/tag/v3.7.1
Based on internal conversations with DataDog Customer Support, this issue is resolved with 3.7.1. Thanks!
Expected Behavior
Receive logs from my AWS Lambda serverless into datadog, using log forwarder - those lambdas should use persistent-connections with databases like mongoDB or Redis, witch means, lambda returns response without close all event loop actions.
Actual Behavior
All my lambdas where i have persistent connection with database using
context.callbackWaitsForEmptyEventLoop = false;
was closed with timed-out behavior, processing steps was made but lambda never returns any response, exactly like context sentence above did not existsSteps to Reproduce the Problem
install log forwarder using this sentence:
Once installed create some lambda setting some database persistent connection like mongoDB, to do this you will need create an connection with database out of handler and set this variable as your connection to act like an cache-connection. By default lambda not return any response if you have external connections open, you generally need to call close connection to receive any response in callback BUT if you set line below you can do this trick.
Learn more about this here
Call this endpoint with some opened external connection and check, you will never receive any response and lambda will close with TIMED-OUT.
Specifications
Stacktrace