Open ItsMeBrianD opened 1 year ago
@Mause we're using the NodeJS client - we're not sure, but perhaps this is new in 0.7.1?
@Mause we're using the NodeJS client - we're not sure, but perhaps this is new in 0.7.1?
Which version does it work with? We can check for changes
As Vercel is running on AWS Lambda as far as I know, I'm having a hard time imagining that this has worked before, as Lambda environments are currently based on Amazon Linux 2, which uses GLIBC 2.26. See https://repost.aws/questions/QUrXOioL46RcCnFGyELJWKLw/glibc-2-27-on-amazon-linux-2
I guess you could download my DuckDB for Lambda layer, and extract the build artifacts: https://github.com/tobilg/duckdb-nodejs-layer#arns
Experiencing similar error on Vercel with both node 18.x and 16.x.
I therefor created https://www.npmjs.com/package/duckdb-lambda-x86 which should solve the actual issue.
@Mause we're using the NodeJS client - we're not sure, but perhaps this is new in 0.7.1?
Which version does it work with? We can check for changes
@archiewood any updates?
I've encountered the same problem as described. Specifically, I'm using duckdb@0.7.1
.
Environment:
node:14
Steps to Reproduce:
docker run --rm -it node:14 bash
In node:14 container
mkdir app && cd app
yarn init -y
yarn add duckdb@0.7.1
cd node_modules/duckdb
npm test
Are there any necessary packages that I need to install?
Tranlated by ChatGPT.
Sorry for my english is not good. I hope there's no offense.
@hanshino the default duckdb
npm package will not work IMO due to GLIBC incompatibilities, as described above. For Lambda usage, I maintain the https://www.npmjs.com/package/duckdb-lambda-x86 package which should fix your issues.
Here's a wrapper over duckdb-async
and duckdb-lambda-x86
that I just wrote, which seems to work both on my M1 macbook (which requires duckdb-async
) and on an EC2 instance where I was previously hitting the GLIBC_2.29 error (where duckdb-lambda-x86
works instead):
// lib/duckdb.ts
let _query: Promise<(query: string) => any>
_query = import("duckdb-async")
.then(duckdb => duckdb.Database)
.then(Database => Database.create(":memory:"))
.then((db: any) => ((query: string) => db.all(query)))
.catch(async error => {
console.log("duckdb init error:", error)
let duckdb = await import("duckdb-lambda-x86");
let Database: any = await duckdb.Database;
const db = new Database(":memory:")
const connection = db.connect()
return (query: string) => {
return new Promise((resolve, reject) => {
connection.all(query, (err: any, res: any) => {
if (err) reject(err);
resolve(res);
})
})
}
})
export { _query }
Sample API endpoint that uses it:
// /api/query.ts
import { _query } from "@/lib/duckdb"
import { NextApiRequest, NextApiResponse } from "next";
// Convert BigInts to numbers
function replacer(key: string, value: any) {
if (typeof value === 'bigint') {
return Number(value)
} else {
return value;
}
}
export default async function handler(
req: NextApiRequest,
res: NextApiResponse,
) {
const { body: { path } } = req
const query = await _query
const rows = await query(`select * from read_parquet("${path}")`) // 🚨 unsafe / SQLi 🚨
res.status(200).send(JSON.stringify(rows, replacer))
}
FYI for others who run into this. I ended up using @tobilg's duckdb-lambda-x86
to resolve this with Vercel. In my case I'm just replacing the default duckdb.node
binary with the duckdb-lambda-x86
version in the CI build.
@michaelwallabi Thank you for the tip - replacing the binary at build/deploy time was by far the most ergonomic solution (and the only one that I was able to work for my project). I want to extend my sincere appreciation to @tobilg for the effort that enabled it in the first place as well.
Ideally running duck db in a lambda should be easy out of the box as it is a great use case, so I look forward to future releases that don't require hacks/workarounds.
Even with replacing the binaries, I am getting the following issue on version 1.0.0. (I am on Vercel, Nodejs 20)
Unhandled Rejection: [Error: IO Error: Can't find the home directory at '' Specify a home directory using the SET home_directory='/path/to/dir' option.] { errno: -1, code: 'DUCKDB_NODEJS_ERROR', errorType: 'IO' }
Setting a homedirectory does also result in an error: Error: TypeError: Failed to set configuration option home_directory: Invalid Input Error: Could not set option "home_directory" as a global option at new Database (/var/task/node_modules/duckdb-async/dist/duckdb-async.js:226:19)
Can anyone help me please? Thank you!
@Dev-rick this worked for me on aws lambda!
Like @iku000888, I do the following when creating a DB, which seems to work:
const db = Database.create(":memory:");
let tempDirectory = tmpdir() || '/tmp';
await (await db).exec(`
SET home_directory='${tempDirectory}';
.... other settings here
`);
@iku000888 and @michaelwallabi Thanks for the input!
Unfortunately I am now getting the following error (on Vercel), on local everything works fine with the same env variables.
Error: HTTP Error: HTTP GET error on 'https://XXX.s3.amazonaws.com/XXX.parquet' (HTTP 400)] { errno: -1, code: 'DUCKDB_NODEJS_ERROR', errorType: 'HTTP' }
My code is:
const S3_LAKE_BUCKET_NAME = process.env.S3_LAKE_BUCKET_NAME
const AWS_S3_ACCESS_KEY = process.env['AWS_S3_ACCESS_KEY']
const AWS_S3_SECRET_KEY = process.env['AWS_S3_SECRET_KEY']
const AWS_S3_REGION = process.env['AWS_S3_REGION']
const retrieveDataFromParquet = async ({
key,
sqlStatement,
tableName,
}: {
key: string
sqlStatement: string
tableName: string
}) => {
try {
// Create a new DuckDB database connection
const db = await Database.create(':memory:')
console.log('Setting home directory...')
await db.all(`SET home_directory='/tmp';`)
console.log('Installing and loading httpfs extension...')
await db.all(`
INSTALL httpfs;
LOAD httpfs;
`)
console.log('Setting S3 credentials...')
await db.all(`
SET s3_region='${AWS_S3_REGION}';
SET s3_access_key_id='${AWS_S3_ACCESS_KEY}';
SET s3_secret_access_key='${AWS_S3_SECRET_KEY}';
`)
// Test S3 access
console.log('Testing S3 access...')
try {
const testResult = await db.all(`
SELECT * FROM parquet_metadata('s3://${S3_LAKE_BUCKET_NAME}/${key}');
`)
console.log('S3 access test result successfully loaded:')
} catch (s3Error) {
console.error('Error testing S3 access:', s3Error)
throw s3Error // Rethrow the error to stop execution
}
// Try to read file info without actually reading the file
console.log('Checking file info...')
try {
const fileInfo = await db.all(`
SELECT * FROM parquet_scan('s3://${S3_LAKE_BUCKET_NAME}/${key}') LIMIT 0;
`)
console.log('File info loaded')
} catch (fileError) {
console.error('Error checking file info:', fileError)
}
// If everything above works, try creating the table
console.log('Creating table...')
await db.all(
`CREATE TABLE ${tableName} AS SELECT * FROM parquet_scan('s3://${S3_LAKE_BUCKET_NAME}/${key}');`,
)
console.log('Table created successfully')
// Execute the query
const result = db.all(sqlStatement)
// Close the database connection
db.close()
// Send the result
return result as unknown as Promise<{ [k: string]: any }[]>
} catch (error) {
console.error('Error:', error)
return null
}
}
Have a look at my implementation at https://github.com/tobilg/serverless-duckdb/blob/main/src/lib/awsSecret.ts and triggering https://github.com/tobilg/serverless-duckdb/blob/main/src/functions/queryS3Express.ts#L95 before any access to S3.
Hint: IMO you also need to pass the SESSION_TOKEN
and eventually the ENDPOINT
as well if you're using S3 One-Zone Express.
I'm wondering why you're seeing a 400 status (invalid request), and not a 403 status though.
@michaelwallabi Thank you for the tip - replacing the binary at build/deploy time was by far the most ergonomic solution (and the only one that I was able to work for my project). I want to extend my sincere appreciation to @tobilg for the effort that enabled it in the first place as well.
Thank you, appreciate the feedback!
Ideally running duck db in a lambda should be easy out of the box as it is a great use case, so I look forward to future releases that don't require hacks/workarounds.
This is honestly not a "fault" from DuckDB, but from AWS using very outdated GLIBC versions in any Node runtimes before Node 20 (see https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtimes.html#runtimes-supported), as Node 20 now uses AL 2023 which has a updated GLIBC that should work with the normal duckdb-node
package as well afaik.
This is honestly not a "fault" from DuckDB, but from AWS using very outdated GLIBC versions in any Node runtimes before Node 20 (see https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtimes.html#runtimes-supported), as Node 20 now uses AL 2023 which has a updated GLIBC that should work with the normal duckdb-node package as well afaik.
Oh hm that is interesting. I thought I was running my lambdas on Node 20 and was getting ELF errors, so either AL 2023 still has issues or I'm not on Node 20 🤔
What happens?
When attempting to deploy some Javascript project to Vercel that leverages SSR and DuckDB; the build fails.
The error message being presented by DuckDB is
/lib64/libm.so.6: version 'GLIBC_2.29' not found (required by /vercel/path0/node_modules/duckdb/lib/binding/duckdb.node
.This has worked previously.
To Reproduce
This repo has a simple reproduction of the issue; simply create a vercel project based on this (or a fork), and the build will fail with the error message https://github.com/ItsMeBrianD/duckdb-vercel-repro
OS:
Vercel
DuckDB Version:
0.7.1
DuckDB Client:
node
Full Name:
Brian Donald
Affiliation:
Evidence
Have you tried this on the latest
master
branch?Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?