Closed wilful closed 7 months ago
Curious how the results are directly using the duckdb cli tool.
Curious how the results are directly using the duckdb cli tool.
Yes, I forgot to write. This request is executed instantly in the console
Would be better to have a benchmark along with a sample database if you can.
Would be better to have a benchmark along with a sample database if you can.
I have added an example database
This might not be obvious but the rust client api will get the query result from duckdb using the arrow interface, not the native c api to duckdb as other client apis do. In this example you’re getting querying and then iterating through a columnar arrow object and copying the data into a row wise vector of struct.
This might not be obvious but the rust client api will get the query result from duckdb using the arrow interface, not the native c api to duckdb as other client apis do. In this example you’re getting querying and then iterating through a columnar arrow object and copying the data into a row wise vector of struct.
But data is stored more or less in arrow format, since its a columnar store?
Are you sure you're not measuring compilation time here @wilful?
I can confirm it's indeed slower.
Arrow is slightly faster but still slower than sqlite.
use std::sync::Arc;
use std::time::Instant;
use arrow::array::{ArrayRef, RecordBatch, StructArray};
use arrow_convert::deserialize::TryIntoCollection;
use arrow_convert::{ArrowDeserialize, ArrowField, ArrowSerialize};
pub fn database() -> rusqlite::Connection {
rusqlite::Connection::open("db.sqlite").unwrap()
}
pub fn duck_database() -> duckdb::Connection {
duckdb::Connection::open("db.duckdb").unwrap()
}
#[derive(Debug, ArrowField, ArrowSerialize, ArrowDeserialize)]
struct Income {
created_at: Option<i32>,
amount: Option<f32>,
category_id: Option<i32>,
wallet_id: Option<i32>,
meta: Option<String>,
}
impl Income {
fn select_duckdb_arrow(start: u32, end: u32) -> Result<Vec<Self>, Box<dyn std::error::Error>> {
let conn = duck_database();
let mut arr = Vec::new();
let sql = format!(
"SELECT created_at, amount, category_id, wallet_id, meta \
FROM 'income' \
WHERE created_at >= {} AND created_at <= {}",
start, end
);
let mut stmt = conn.prepare(&sql)?;
let arrow = stmt.query_arrow([])?.collect::<Vec<RecordBatch>>();
for batch in arrow {
let array: ArrayRef = Arc::new(StructArray::from(batch));
let result: Vec<Income> = array.try_into_collection().unwrap();
arr.extend(result);
}
Ok(arr)
}
fn select_duckdb(start: u32, end: u32) -> Result<Vec<Self>, Box<dyn std::error::Error>> {
let conn = duck_database();
let mut arr = Vec::new();
let sql = format!(
"SELECT created_at, amount, category_id, wallet_id, meta \
FROM 'income' \
WHERE created_at >= {} AND created_at <= {}",
start, end
);
let mut stmt = conn.prepare(&sql)?;
let result_iter = stmt.query_map([], |row| {
Ok(Self {
created_at: row.get(0)?,
amount: row.get(1)?,
category_id: row.get(2)?,
wallet_id: row.get(3)?,
meta: row.get(4)?,
})
})?;
for result in result_iter {
arr.push(result?);
}
Ok(arr)
}
fn select_sqlite(start: u32, end: u32) -> Result<Vec<Self>, Box<dyn std::error::Error>> {
let conn = database();
let mut arr = Vec::new();
let sql = format!(
"SELECT created_at, amount, category_id, wallet_id, meta \
FROM 'income' \
WHERE created_at >= {} AND created_at <= {}",
start, end
);
let mut stmt = conn.prepare(&sql)?;
let result_iter = stmt.query_map([], |row| {
Ok(Self {
created_at: row.get(0)?,
amount: row.get(1)?,
category_id: row.get(2)?,
wallet_id: row.get(3)?,
meta: row.get(4)?,
})
})?;
for result in result_iter {
arr.push(result?);
}
Ok(arr)
}
}
fn bencher(name: &'static str, f: impl Fn() -> ()) -> impl Fn() -> () {
move || {
let start = Instant::now();
f();
let duration = start.elapsed();
println!("Time elapsed in {name} is: {:?}", duration);
}
}
fn main() {
let sqlite_test = bencher("sqlite_test", || {
// SELECT created_at, amount, category_id, wallet_id, meta FROM 'income' WHERE created_at >= 1709292049 AND created_at <= 1711375239;
let out = Income::select_sqlite(1709292049, 1711375239).unwrap();
println!("Got {:?} records", out.len());
});
let duckdb_test = bencher("duckdb_test", || {
// SELECT created_at, amount, category_id, wallet_id, meta FROM 'income' WHERE created_at >= 1709292049 AND created_at <= 1711375239;
let out = Income::select_duckdb(1709292049, 1711375239).unwrap();
println!("Got {:?} records", out.len());
});
let duckdb_arrow_test = bencher("duckdb_test_arrow", || {
// SELECT created_at, amount, category_id, wallet_id, meta FROM 'income' WHERE created_at >= 1709292049 AND created_at <= 1711375239;
let out = Income::select_duckdb_arrow(1709292049, 1711375239).unwrap();
println!("Got {:?} records", out.len());
});
for _ in 0..3 {
sqlite_test();
duckdb_test();
duckdb_arrow_test();
}
}
While I am seeing a difference, it's not in the order of minutes as you stated
Sqlite: 303.154µs DuckDB: 19.454103ms
Granted that's a 64 times difference, but it's not minutes
I went a bit further. Here are my findings:
rusqlite
does it by default. Would be nicer to expand this to a wider benchmark suite.
I now get, all under more or less same numbers
Time elapsed in sqlite_test is: 52.125µs
Time elapsed in duckdb_test is: 55.291µs
Time elapsed in duckdb_test_arrow is: 56.459µs
Unfortunately, my technical skills in RUST do not allow me to conduct any accurate tests. In this example, I compiled my application with DUCKDB and SQLITE and performed a selection of 1000 items each (the database is exactly the same as in the example that I added)
The difference is huge and is deteriorating exponentially
Sounds like the sample code you provided isn't actually representative of your code then?
Unfortunately, I can't demonstrate the entire code to everyone. But I have highlighted all the functions that use databases. Otherwise, the launch is identical. =(
I have the following code example and run this in MacOS. The databases are completely identical, but if I execute a query in SQLITE, I get a response instantly, if I execute it in DUCKDB, then a query on 1000 rows takes an unreasonably long time, several minutes. On data up to 100 rows, the difference is about 3 times in favor of sqlite.
I am using the default settings, without any edits. This is expected behavior or is there a problem?
Example database output.csv