DataDog / datadog-lambda-js

The Datadog AWS Lambda Library for Node
Apache License 2.0
113 stars 35 forks source link

We are experiencing a weird issue where datadog's lambda layer is loading a module which is an optional dependency of its parent package #397

Closed rubythulhu closed 1 year ago

rubythulhu commented 1 year ago

Our error comes from these three lines: https://github.com/DataDog/datadog-lambda-js/blob/main/src/runtime/user-function.ts#L215-L217

Expected Behavior

As outlined in mscdex/ssh2#1310, to the ssh2 module, in normal circumstances where datadog-lambda-js is not involved, the bundled sshcrypto dependency is NOT required. If this file is missing, ssh2 package itself just moves along. However, datadog pukes on its own shoes trying to load this optional module.

Actual Behavior

When trying to bundle a lambda with optional runtime dependencies, in a try {require(something)} catch { /* do nothing */ } block, datadog lambda layer will attempt to require this file anyway, and cause a runtime failure when it doesn't exist.

Steps to Reproduce the Problem

  1. Create a Lambda function which uses ssh2-sftp-client to upload a file to any sftp server
  2. DO NOT bundle the (OPTIONAL) sshcrypto.node binary module. if necessary, delete it beforehand
  3. add datadog-lambda-js
  4. Run the lambda. It will fail trying to load this dependency, despite the source explicitly try/catch'ing the require

Specifications

Stacktrace

{
    "errorType": "Runtime.ImportModuleError",
    "errorMessage": "Error: Cannot find module 'ssh2'\nRequire stack:\n- [REDACTED FILENAME]\n- /opt/nodejs/node_modules/datadog-lambda-js/runtime/user-function.js\n- /opt/nodejs/node_modules/datadog-lambda-js/runtime/index.js",
    "stack": [
        "Runtime.ImportModuleError: Error: Cannot find module 'ssh2'",
        "Require stack:",
        "- [REDACTED FILENAME]",
        "- /opt/nodejs/node_modules/datadog-lambda-js/runtime/user-function.js",
        "- /opt/nodejs/node_modules/datadog-lambda-js/runtime/index.js",
        "    at ImportModuleError.ExtendedError [as constructor] (/opt/nodejs/node_modules/datadog-lambda-js/runtime/errors.js:113:28)",
        "    at new ImportModuleError (/opt/nodejs/node_modules/datadog-lambda-js/runtime/errors.js:123:42)",
        "    at /opt/nodejs/node_modules/datadog-lambda-js/runtime/user-function.js:273:31",
        "    at step (/opt/nodejs/node_modules/datadog-lambda-js/runtime/user-function.js:43:23)",
        "    at Object.throw (/opt/nodejs/node_modules/datadog-lambda-js/runtime/user-function.js:24:53)",
        "    at rejected (/opt/nodejs/node_modules/datadog-lambda-js/runtime/user-function.js:16:65)"
    ]
}

There is a chance that this comes from esbuild, as esbuild is part of our lambda compilation step. I will also be submitting a bug report to them.

astuyve commented 1 year ago

Hey @rubythulhu, thanks for reaching out!

The reason Datadog shows up in this stack trace (and any stack trace where a module can't be loaded or the runtime pukes on its own shoes) is because our instrumentation functions by requiring your handler code (source). If Node can't load that file, it pukes.

I'm not sure we do anything special here - this same stacktrace could happen if the handler included a syntax error, or similarly imported a file that didn't exist. We don't depend on that file, nor do we require it explicitly.

As you've esbuilt this project, I can't really check where else this is calling from in your own handler code or the dependencies packaged. Can you extract the code around this error and see where require(ssh2) is being called? If you remove esbuild from your application, does this still throw an error? If you remove this library, does it throw an error?

I read your conversation in esbuild and ssh2, and see your comment:

Apparently, due to an issue in AWS lambda where their runtime breaks the try {require} catch {} (in ssh2 itself, and the compiled esbuild version adds another catch{} layer) and causes the runtime error, the only ways to fix this are all different variations of "remove the part where ssh2 tries to load the binary module entirely":

I'm not sure I have more to add here, except that I've extracted the node runtimes into a gist, which may help your investigation.

Thanks again!

rubythulhu commented 1 year ago

I am unable to replicate the issue by doing a simple esbuild-free lambda which also does not contain the datadog layer:

let x;
try { x = require("./foo"); console.log("missing module require ok?") } 
catch(e) { console.log("caught error", e); }
console.log("setup complete");

module.exports.handler = async(event) => {
    const response = {
        statusCode: 200,
        body: JSON.stringify('Hello from Lambda!'),
    };
    return response;
};

Logs say:

Response
{
  "statusCode": 200,
  "body": "\"Hello from Lambda!\""
}

Function Logs
2023-06-27T22:37:44.324Z    undefined   INFO    caught error Error: Cannot find module './foo'
Require stack:
- /var/task/index.js
- /var/runtime/index.mjs
    at Function.Module._resolveFilename (node:internal/modules/cjs/loader:1026:15)
    at Function.Module._load (node:internal/modules/cjs/loader:871:27)
    at Module.require (node:internal/modules/cjs/loader:1098:19)
    at require (node:internal/modules/cjs/helpers:108:18)
    at Object.<anonymous> (/var/task/index.js:2:11)
    at Module._compile (node:internal/modules/cjs/loader:1196:14)
    at Object.Module._extensions..js (node:internal/modules/cjs/loader:1250:10)
    at Module.load (node:internal/modules/cjs/loader:1074:32)
    at Function.Module._load (node:internal/modules/cjs/loader:909:12)
    at Module.require (node:internal/modules/cjs/loader:1098:19) {
  code: 'MODULE_NOT_FOUND',
  requireStack: [ '/var/task/index.js', '/var/runtime/index.mjs' ]
}
2023-06-27T22:37:44.324Z    undefined   INFO    setup complete
START RequestId: 767dac5b-f54b-419a-aa4e-f4ea2cd13521 Version: $LATEST
END RequestId: 767dac5b-f54b-419a-aa4e-f4ea2cd13521
REPORT RequestId: 767dac5b-f54b-419a-aa4e-f4ea2cd13521  Duration: 10.51 ms  Billed Duration: 11 ms  Memory Size: 128 MB Max Memory Used: 58 MB  Init Duration: 146.56 ms

Request ID
767dac5b-f54b-419a-aa4e-f4ea2cd13521

However, I am able to validate that the esbuild-compiled source does INDEED contain the try{ require } catch{} code verbatim, without changes, so it seems from the above evidence that neither AWS lambda runtime nor esbuild is causing this problem.

When I have a little more time I will try a standalone test that includes the datadog layer and configuration, to see if i can get this to fail that way, but i'll need to use a more complicated setup than "manually create single-file lambda" probably to get an accurate test.

astuyve commented 1 year ago

Okay thanks @rubythulhu that is helpful.

The error from that trace is coming from the cjs loader, and what's interesting is that we're actually throwing an ImportModuleError - which is something we only create if an unhandled exception is thrown when loading the user's handler, and the e.code === MODULE_NOT_FOUND. You can find it here

Is there a chance the catch clause is re-raising? Or maybe this is coming from some other part of the file? If you don't package with es-build, does this go away?

It'd be really helpful if we could have a reproducible example here, as without seeing your handler logic or the full/unredacted stack trace I don't think I can explore much more.

We may be able to make more progress if you can open a support ticket and share these details privately. Feel free to attach this github thread, and you'll be quickly escalated to me.

Thanks!

rubythulhu commented 1 year ago

Is there a chance the catch clause is re-raising? Or maybe this is coming from some other part of the file? If you don't package with es-build, does this go away?

I don't believe so. esbuild does complicate the original code during the compilation process, but:

// node_modules/ssh2/lib/protocol/crypto/build/Release/sshcrypto.node
var sshcrypto_default;
var init_sshcrypto = __esm({
  "node_modules/ssh2/lib/protocol/crypto/build/Release/sshcrypto.node"() {
    sshcrypto_default = "./sshcrypto-PYHRYRPA.node";
  }
});

// node-file:/Users/ruby/work/canopy/canopy-api/node_modules/ssh2/lib/protocol/crypto/build/Release/sshcrypto.node
var require_sshcrypto = __commonJS({
  "node-file:/Users/ruby/work/canopy/canopy-api/node_modules/ssh2/lib/protocol/crypto/build/Release/sshcrypto.node"(exports, module2) {
    init_sshcrypto();
    try {
      module2.exports = require(sshcrypto_default);
    } catch {
    }
  }
});

// node_modules/ssh2/lib/protocol/crypto.js
var require_crypto = __commonJS({
  "node_modules/ssh2/lib/protocol/crypto.js"(exports, module2) {
    "use strict";
    var {
      createCipheriv,
      createDecipheriv,
      createHmac,
      randomFillSync,
      timingSafeEqual
    } = require("crypto");
    var { readUInt32BE, writeUInt32BE } = require_utils();
    var FastBuffer = Buffer[Symbol.species];
    var MAX_SEQNO = 2 ** 32 - 1;
    var EMPTY_BUFFER = Buffer.alloc(0);
    var BUF_INT = Buffer.alloc(4);
    var DISCARD_CACHE = /* @__PURE__ */ new Map();
    var MAX_PACKET_SIZE = 35e3;
    var binding;
    var AESGCMCipher;
    var ChaChaPolyCipher;
    var GenericCipher;
    var AESGCMDecipher;
    var ChaChaPolyDecipher;
    var GenericDecipher;
    try {
      binding = require_sshcrypto();
      ({
        AESGCMCipher,
        ChaChaPolyCipher,
        GenericCipher,
        AESGCMDecipher,
        ChaChaPolyDecipher,
        GenericDecipher
      } = binding);
    } catch {
    }

This exception is caught with catch { /* do nothing */ } in both cases above.

astuyve commented 1 year ago

Hi @rubythulhu - thanks again. One thing to check - are you possibly using other datadog tools like datadog-ci or datadog's serverless plugin?

If so, they should be dev dependencies of your project and not included in any runtime code, but those both have dependencies on ssh2 (the plugin depends on datadog-ci), and I'm wondering if somehow something meant for a dev env is being loaded in your Lambda function?

astuyve commented 1 year ago

Hey @rubythulhu - since we haven't heard from you for a few weeks and don't have a way to reproduce this issue, I don't think I have a way to move forward with this issue. I'm going to close it for now, but feel free to re-open if you've got an example which reproduces this bug.

Thank you!