duckdb / duckdb-node

MIT License
55 stars 27 forks source link

Serverside Rendering on Vercel fails; missing GLIBC_2.29 #15

Open ItsMeBrianD opened 1 year ago

ItsMeBrianD commented 1 year ago

What happens?

When attempting to deploy some Javascript project to Vercel that leverages SSR and DuckDB; the build fails.

The error message being presented by DuckDB is /lib64/libm.so.6: version 'GLIBC_2.29' not found (required by /vercel/path0/node_modules/duckdb/lib/binding/duckdb.node.

This has worked previously.

To Reproduce

This repo has a simple reproduction of the issue; simply create a vercel project based on this (or a fork), and the build will fail with the error message https://github.com/ItsMeBrianD/duckdb-vercel-repro

OS:

Vercel

DuckDB Version:

0.7.1

DuckDB Client:

node

Full Name:

Brian Donald

Affiliation:

Evidence

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

archiewood commented 1 year ago

@Mause we're using the NodeJS client - we're not sure, but perhaps this is new in 0.7.1?

Mause commented 1 year ago

@Mause we're using the NodeJS client - we're not sure, but perhaps this is new in 0.7.1?

Which version does it work with? We can check for changes

tobilg commented 1 year ago

As Vercel is running on AWS Lambda as far as I know, I'm having a hard time imagining that this has worked before, as Lambda environments are currently based on Amazon Linux 2, which uses GLIBC 2.26. See https://repost.aws/questions/QUrXOioL46RcCnFGyELJWKLw/glibc-2-27-on-amazon-linux-2

I guess you could download my DuckDB for Lambda layer, and extract the build artifacts: https://github.com/tobilg/duckdb-nodejs-layer#arns

pgzmnk commented 1 year ago

Experiencing similar error on Vercel with both node 18.x and 16.x.

https://github.com/pgzmnk/openb

image
tobilg commented 1 year ago

I therefor created https://www.npmjs.com/package/duckdb-lambda-x86 which should solve the actual issue.

Mause commented 1 year ago

@Mause we're using the NodeJS client - we're not sure, but perhaps this is new in 0.7.1?

Which version does it work with? We can check for changes

@archiewood any updates?

hanshino commented 1 year ago

I've encountered the same problem as described. Specifically, I'm using duckdb@0.7.1.

Environment:

Steps to Reproduce:

   docker run --rm -it node:14 bash

In node:14 container

   mkdir app && cd app
   yarn init -y
   yarn add duckdb@0.7.1
   cd node_modules/duckdb
   npm test

Are there any necessary packages that I need to install?

Tranlated by ChatGPT.


Sorry for my english is not good. I hope there's no offense.

tobilg commented 1 year ago

@hanshino the default duckdb npm package will not work IMO due to GLIBC incompatibilities, as described above. For Lambda usage, I maintain the https://www.npmjs.com/package/duckdb-lambda-x86 package which should fix your issues.

ryan-williams commented 11 months ago

Here's a wrapper over duckdb-async and duckdb-lambda-x86 that I just wrote, which seems to work both on my M1 macbook (which requires duckdb-async) and on an EC2 instance where I was previously hitting the GLIBC_2.29 error (where duckdb-lambda-x86 works instead):

// lib/duckdb.ts
let _query: Promise<(query: string) => any>

_query = import("duckdb-async")
    .then(duckdb => duckdb.Database)
    .then(Database => Database.create(":memory:"))
    .then((db: any) => ((query: string) => db.all(query)))
    .catch(async error => {
        console.log("duckdb init error:", error)
        let duckdb = await import("duckdb-lambda-x86");
        let Database: any = await duckdb.Database;
        const db = new Database(":memory:")
        const connection = db.connect()
        return (query: string) => {
            return new Promise((resolve, reject) => {
                connection.all(query, (err: any, res: any) => {
                    if (err) reject(err);
                    resolve(res);
                })
            })
        }
    })

export { _query }

Sample API endpoint that uses it:

// /api/query.ts
import { _query } from "@/lib/duckdb"
import { NextApiRequest, NextApiResponse } from "next";

// Convert BigInts to numbers
function replacer(key: string, value: any) {
    if (typeof value === 'bigint') {
        return Number(value)
    } else {
        return value;
    }
}

export default async function handler(
    req: NextApiRequest,
    res: NextApiResponse,
) {
    const { body: { path } } = req
    const query = await _query
    const rows = await query(`select * from read_parquet("${path}")`)  // 🚨 unsafe / SQLi 🚨
    res.status(200).send(JSON.stringify(rows, replacer))
}
michaelwallabi commented 8 months ago

FYI for others who run into this. I ended up using @tobilg's duckdb-lambda-x86 to resolve this with Vercel. In my case I'm just replacing the default duckdb.node binary with the duckdb-lambda-x86 version in the CI build.

iku000888 commented 2 months ago

@michaelwallabi Thank you for the tip - replacing the binary at build/deploy time was by far the most ergonomic solution (and the only one that I was able to work for my project). I want to extend my sincere appreciation to @tobilg for the effort that enabled it in the first place as well.

Ideally running duck db in a lambda should be easy out of the box as it is a great use case, so I look forward to future releases that don't require hacks/workarounds.

Dev-rick commented 1 month ago

Even with replacing the binaries, I am getting the following issue on version 1.0.0. (I am on Vercel, Nodejs 20)

Unhandled Rejection: [Error: IO Error: Can't find the home directory at '' Specify a home directory using the SET home_directory='/path/to/dir' option.] { errno: -1, code: 'DUCKDB_NODEJS_ERROR', errorType: 'IO' }

Setting a homedirectory does also result in an error: Error: TypeError: Failed to set configuration option home_directory: Invalid Input Error: Could not set option "home_directory" as a global option at new Database (/var/task/node_modules/duckdb-async/dist/duckdb-async.js:226:19)

Can anyone help me please? Thank you!

iku000888 commented 1 month ago

@Dev-rick this worked for me on aws lambda!

https://github.com/tobilg/serverless-duckdb/blob/87ad3c5d1bbbb8e03a80e6ad943da53c3a556a21/src/functions/query.ts#L73

michaelwallabi commented 1 month ago

Like @iku000888, I do the following when creating a DB, which seems to work:

    const db = Database.create(":memory:");
    let tempDirectory = tmpdir() || '/tmp';
    await (await db).exec(`
        SET home_directory='${tempDirectory}';
        .... other settings here
        `);
Dev-rick commented 1 month ago

@iku000888 and @michaelwallabi Thanks for the input!

Unfortunately I am now getting the following error (on Vercel), on local everything works fine with the same env variables.

Error: HTTP Error: HTTP GET error on 'https://XXX.s3.amazonaws.com/XXX.parquet' (HTTP 400)] { errno: -1, code: 'DUCKDB_NODEJS_ERROR', errorType: 'HTTP' }

My code is:

const S3_LAKE_BUCKET_NAME = process.env.S3_LAKE_BUCKET_NAME
const AWS_S3_ACCESS_KEY = process.env['AWS_S3_ACCESS_KEY']
const AWS_S3_SECRET_KEY = process.env['AWS_S3_SECRET_KEY']
const AWS_S3_REGION = process.env['AWS_S3_REGION']

const retrieveDataFromParquet = async ({
  key,
  sqlStatement,
  tableName,
}: {
  key: string
  sqlStatement: string
  tableName: string
}) => {
  try {
    // Create a new DuckDB database connection
    const db = await Database.create(':memory:')

    console.log('Setting home directory...')
    await db.all(`SET home_directory='/tmp';`)

    console.log('Installing and loading httpfs extension...')
    await db.all(`
      INSTALL httpfs;
      LOAD httpfs;
    `)

    console.log('Setting S3 credentials...')
    await db.all(`
      SET s3_region='${AWS_S3_REGION}';
      SET s3_access_key_id='${AWS_S3_ACCESS_KEY}';
      SET s3_secret_access_key='${AWS_S3_SECRET_KEY}';
    `)

    // Test S3 access
    console.log('Testing S3 access...')
    try {
      const testResult = await db.all(`
        SELECT * FROM parquet_metadata('s3://${S3_LAKE_BUCKET_NAME}/${key}');
      `)
      console.log('S3 access test result successfully loaded:')
    } catch (s3Error) {
      console.error('Error testing S3 access:', s3Error)
      throw s3Error // Rethrow the error to stop execution
    }

    // Try to read file info without actually reading the file
    console.log('Checking file info...')
    try {
      const fileInfo = await db.all(`
        SELECT * FROM parquet_scan('s3://${S3_LAKE_BUCKET_NAME}/${key}') LIMIT 0;
      `)
      console.log('File info loaded')
    } catch (fileError) {
      console.error('Error checking file info:', fileError)
    }

    // If everything above works, try creating the table
    console.log('Creating table...')
    await db.all(
      `CREATE TABLE ${tableName} AS SELECT * FROM parquet_scan('s3://${S3_LAKE_BUCKET_NAME}/${key}');`,
    )

    console.log('Table created successfully')

    // Execute the query
    const result = db.all(sqlStatement)

    // Close the database connection
    db.close()

    // Send the result
    return result as unknown as Promise<{ [k: string]: any }[]>
  } catch (error) {
    console.error('Error:', error)
    return null
  }
}
tobilg commented 1 month ago

Have a look at my implementation at https://github.com/tobilg/serverless-duckdb/blob/main/src/lib/awsSecret.ts and triggering https://github.com/tobilg/serverless-duckdb/blob/main/src/functions/queryS3Express.ts#L95 before any access to S3.

Hint: IMO you also need to pass the SESSION_TOKEN and eventually the ENDPOINT as well if you're using S3 One-Zone Express.

I'm wondering why you're seeing a 400 status (invalid request), and not a 403 status though.

tobilg commented 1 month ago

@michaelwallabi Thank you for the tip - replacing the binary at build/deploy time was by far the most ergonomic solution (and the only one that I was able to work for my project). I want to extend my sincere appreciation to @tobilg for the effort that enabled it in the first place as well.

Thank you, appreciate the feedback!

Ideally running duck db in a lambda should be easy out of the box as it is a great use case, so I look forward to future releases that don't require hacks/workarounds.

This is honestly not a "fault" from DuckDB, but from AWS using very outdated GLIBC versions in any Node runtimes before Node 20 (see https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtimes.html#runtimes-supported), as Node 20 now uses AL 2023 which has a updated GLIBC that should work with the normal duckdb-node package as well afaik.

iku000888 commented 1 month ago

This is honestly not a "fault" from DuckDB, but from AWS using very outdated GLIBC versions in any Node runtimes before Node 20 (see https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtimes.html#runtimes-supported), as Node 20 now uses AL 2023 which has a updated GLIBC that should work with the normal duckdb-node package as well afaik.

Oh hm that is interesting. I thought I was running my lambdas on Node 20 and was getting ELF errors, so either AL 2023 still has issues or I'm not on Node 20 🤔