duckdb / duckdb-rs

Ergonomic bindings to duckdb for Rust
MIT License
517 stars 113 forks source link

Compared to sqlite, query execution speed is several times slower in rust code #282

Closed wilful closed 7 months ago

wilful commented 8 months ago

I have the following code example and run this in MacOS. The databases are completely identical, but if I execute a query in SQLITE, I get a response instantly, if I execute it in DUCKDB, then a query on 1000 rows takes an unreasonably long time, several minutes. On data up to 100 rows, the difference is about 3 times in favor of sqlite.

pub fn database() -> rusqlite::Connection {
    rusqlite::Connection::open("db.sqlite").unwrap()
}

pub fn duck_database() -> duckdb::Connection {
    duckdb::Connection::open("db.duckdb").unwrap()
}

#[derive(Debug)]
struct Income(usize, f64, usize, usize, String);

impl Income {
    fn select(start: u32, end: u32) -> Result<Vec<Self>, Box<dyn std::error::Error>> {
        let conn = database();
        let mut arr = Vec::new();
        let sql = format!(
            "SELECT created_at, amount, category_id, wallet_id, meta \
            FROM 'income' \
            WHERE created_at >= {} AND created_at <= {}",
            start, end
        );
        let mut stmt = conn.prepare(&sql)?;
        let result_iter = stmt.query_map([], |row| {
            Ok(Self(
                row.get(0)?,
                row.get(1)?,
                row.get(2)?,
                row.get(3)?,
                row.get(4)?,
            ))
        })?;
        for result in result_iter {
            arr.push(result?);
        }
        Ok(arr)
    }
}

fn main() {
    // SELECT created_at, amount, category_id, wallet_id, meta FROM 'income' WHERE created_at >= 1709292049 AND created_at <= 1711375239;
    let out = Income::select(1709292049, 1711375239).unwrap();
    for i in out {
        println!("{:?}", i);
    }
}

I am using the default settings, without any edits. This is expected behavior or is there a problem?

Example database output.csv

SQLITE: cargo run &> /dev/null  0.05s user 0.04s system 30% cpu 0.311 total
DUCKDB: cargo run &> /dev/null  0.28s user 0.06s system 6% cpu 4.982 total
Swoorup commented 8 months ago

Curious how the results are directly using the duckdb cli tool.

wilful commented 8 months ago

Curious how the results are directly using the duckdb cli tool.

Yes, I forgot to write. This request is executed instantly in the console

Swoorup commented 8 months ago

Would be better to have a benchmark along with a sample database if you can.

wilful commented 8 months ago

Would be better to have a benchmark along with a sample database if you can.

I have added an example database

era127 commented 8 months ago

This might not be obvious but the rust client api will get the query result from duckdb using the arrow interface, not the native c api to duckdb as other client apis do. In this example you’re getting querying and then iterating through a columnar arrow object and copying the data into a row wise vector of struct.

Swoorup commented 8 months ago

This might not be obvious but the rust client api will get the query result from duckdb using the arrow interface, not the native c api to duckdb as other client apis do. In this example you’re getting querying and then iterating through a columnar arrow object and copying the data into a row wise vector of struct.

But data is stored more or less in arrow format, since its a columnar store?

Mause commented 7 months ago

Are you sure you're not measuring compilation time here @wilful?

Swoorup commented 7 months ago

I can confirm it's indeed slower.

Arrow is slightly faster but still slower than sqlite.

use std::sync::Arc;
use std::time::Instant;

use arrow::array::{ArrayRef, RecordBatch, StructArray};
use arrow_convert::deserialize::TryIntoCollection;
use arrow_convert::{ArrowDeserialize, ArrowField, ArrowSerialize};

pub fn database() -> rusqlite::Connection {
  rusqlite::Connection::open("db.sqlite").unwrap()
}

pub fn duck_database() -> duckdb::Connection {
  duckdb::Connection::open("db.duckdb").unwrap()
}

#[derive(Debug, ArrowField, ArrowSerialize, ArrowDeserialize)]
struct Income {
  created_at: Option<i32>,
  amount: Option<f32>,
  category_id: Option<i32>,
  wallet_id: Option<i32>,
  meta: Option<String>,
}

impl Income {
  fn select_duckdb_arrow(start: u32, end: u32) -> Result<Vec<Self>, Box<dyn std::error::Error>> {
    let conn = duck_database();
    let mut arr = Vec::new();
    let sql = format!(
      "SELECT created_at, amount, category_id, wallet_id, meta \
          FROM 'income' \
          WHERE created_at >= {} AND created_at <= {}",
      start, end
    );
    let mut stmt = conn.prepare(&sql)?;
    let arrow = stmt.query_arrow([])?.collect::<Vec<RecordBatch>>();
    for batch in arrow {
      let array: ArrayRef = Arc::new(StructArray::from(batch));
      let result: Vec<Income> = array.try_into_collection().unwrap();
      arr.extend(result);
    }
    Ok(arr)
  }

  fn select_duckdb(start: u32, end: u32) -> Result<Vec<Self>, Box<dyn std::error::Error>> {
    let conn = duck_database();
    let mut arr = Vec::new();
    let sql = format!(
      "SELECT created_at, amount, category_id, wallet_id, meta \
          FROM 'income' \
          WHERE created_at >= {} AND created_at <= {}",
      start, end
    );
    let mut stmt = conn.prepare(&sql)?;
    let result_iter = stmt.query_map([], |row| {
      Ok(Self {
        created_at: row.get(0)?,
        amount: row.get(1)?,
        category_id: row.get(2)?,
        wallet_id: row.get(3)?,
        meta: row.get(4)?,
      })
    })?;
    for result in result_iter {
      arr.push(result?);
    }
    Ok(arr)
  }

  fn select_sqlite(start: u32, end: u32) -> Result<Vec<Self>, Box<dyn std::error::Error>> {
    let conn = database();
    let mut arr = Vec::new();
    let sql = format!(
      "SELECT created_at, amount, category_id, wallet_id, meta \
          FROM 'income' \
          WHERE created_at >= {} AND created_at <= {}",
      start, end
    );
    let mut stmt = conn.prepare(&sql)?;
    let result_iter = stmt.query_map([], |row| {
      Ok(Self {
        created_at: row.get(0)?,
        amount: row.get(1)?,
        category_id: row.get(2)?,
        wallet_id: row.get(3)?,
        meta: row.get(4)?,
      })
    })?;
    for result in result_iter {
      arr.push(result?);
    }
    Ok(arr)
  }
}

fn bencher(name: &'static str, f: impl Fn() -> ()) -> impl Fn() -> () {
  move || {
    let start = Instant::now();
    f();
    let duration = start.elapsed();
    println!("Time elapsed in {name} is: {:?}", duration);
  }
}

fn main() {
  let sqlite_test = bencher("sqlite_test", || {
    // SELECT created_at, amount, category_id, wallet_id, meta FROM 'income' WHERE created_at >= 1709292049 AND created_at <= 1711375239;
    let out = Income::select_sqlite(1709292049, 1711375239).unwrap();
    println!("Got {:?} records", out.len());
  });

  let duckdb_test = bencher("duckdb_test", || {
    // SELECT created_at, amount, category_id, wallet_id, meta FROM 'income' WHERE created_at >= 1709292049 AND created_at <= 1711375239;
    let out = Income::select_duckdb(1709292049, 1711375239).unwrap();
    println!("Got {:?} records", out.len());
  });

  let duckdb_arrow_test = bencher("duckdb_test_arrow", || {
    // SELECT created_at, amount, category_id, wallet_id, meta FROM 'income' WHERE created_at >= 1709292049 AND created_at <= 1711375239;
    let out = Income::select_duckdb_arrow(1709292049, 1711375239).unwrap();
    println!("Got {:?} records", out.len());
  });

  for _ in 0..3 {
    sqlite_test();
    duckdb_test();
    duckdb_arrow_test();
  }
}
Mause commented 7 months ago

While I am seeing a difference, it's not in the order of minutes as you stated

Sqlite: 303.154µs DuckDB: 19.454103ms

Granted that's a 64 times difference, but it's not minutes

Swoorup commented 7 months ago

flamegraph

I went a bit further. Here are my findings:

Would be nicer to expand this to a wider benchmark suite.

I now get, all under more or less same numbers

Time elapsed in sqlite_test is: 52.125µs
Time elapsed in duckdb_test is: 55.291µs
Time elapsed in duckdb_test_arrow is: 56.459µs
Code ```rust use std::sync::Arc; use std::time::Instant; use arrow::array::{ArrayRef, RecordBatch, StructArray}; use arrow_convert::deserialize::TryIntoCollection; use arrow_convert::{ArrowDeserialize, ArrowField, ArrowSerialize}; pub fn sqlite_db() -> rusqlite::Connection { rusqlite::Connection::open("db.sqlite").unwrap() } pub fn duck_database() -> duckdb::Connection { duckdb::Connection::open("db.duckdb").unwrap() } #[derive(Debug, ArrowField, ArrowSerialize, ArrowDeserialize)] struct Income { created_at: Option, amount: Option, category_id: Option, wallet_id: Option, meta: Option, } impl Income { fn select_duckdb_arrow( conn: &duckdb::Connection, start: u32, end: u32, ) -> Result, Box> { let mut arr = Vec::new(); let sql = format!( "SELECT created_at, amount, category_id, wallet_id, meta \ FROM 'income' \ WHERE created_at >= {} AND created_at <= {}", start, end ); let mut stmt = conn.prepare_cached(&sql)?; let arrow = stmt.query_arrow([])?.collect::>(); for batch in arrow { let array: ArrayRef = Arc::new(StructArray::from(batch)); let result: Vec = array.try_into_collection().unwrap(); arr.extend(result); } Ok(arr) } fn select_duckdb(conn: &duckdb::Connection, start: u32, end: u32) -> Result, Box> { let mut arr = Vec::new(); let sql = format!( "SELECT created_at, amount, category_id, wallet_id, meta \ FROM 'income' \ WHERE created_at >= {} AND created_at <= {}", start, end ); let mut stmt = conn.prepare_cached(&sql)?; let result_iter = stmt.query_map([], |row| { Ok(Self { created_at: row.get(0)?, amount: row.get(1)?, category_id: row.get(2)?, wallet_id: row.get(3)?, meta: row.get(4)?, }) })?; for result in result_iter { arr.push(result?); } Ok(arr) } fn select_sqlite(conn: &rusqlite::Connection, start: u32, end: u32) -> Result, Box> { let mut arr = Vec::new(); let sql = format!( "SELECT created_at, amount, category_id, wallet_id, meta \ FROM 'income' \ WHERE created_at >= {} AND created_at <= {}", start, end ); let mut stmt = conn.prepare(&sql)?; let result_iter = stmt.query_map([], |row| { Ok(Self { created_at: row.get(0)?, amount: row.get(1)?, category_id: row.get(2)?, wallet_id: row.get(3)?, meta: row.get(4)?, }) })?; for result in result_iter { arr.push(result?); } Ok(arr) } } fn bencher(name: &'static str, f: impl Fn() -> ()) -> impl Fn() -> () { move || { let start = Instant::now(); f(); let duration = start.elapsed(); println!("Time elapsed in {name} is: {:?}", duration); } } fn main() { let duckdb_conn = duck_database(); let sqlite_conn = sqlite_db(); let sqlite_test = bencher("sqlite_test", || { // SELECT created_at, amount, category_id, wallet_id, meta FROM 'income' WHERE created_at >= 1709292049 AND created_at <= 1711375239; let out = Income::select_sqlite(&sqlite_conn, 1709292049, 1711375239).unwrap(); println!("Got {:?} records", out.len()); }); let duckdb_test = bencher("duckdb_test", || { // SELECT created_at, amount, category_id, wallet_id, meta FROM 'income' WHERE created_at >= 1709292049 AND created_at <= 1711375239; let out = Income::select_duckdb(&duckdb_conn, 1709292049, 1711375239).unwrap(); println!("Got {:?} records", out.len()); }); let duckdb_arrow_test = bencher("duckdb_test_arrow", || { // SELECT created_at, amount, category_id, wallet_id, meta FROM 'income' WHERE created_at >= 1709292049 AND created_at <= 1711375239; let out = Income::select_duckdb_arrow(&duckdb_conn, 1709292049, 1711375239).unwrap(); println!("Got {:?} records", out.len()); }); for _ in 0..10_000 { sqlite_test(); duckdb_test(); duckdb_arrow_test(); } } ```
wilful commented 7 months ago

Unfortunately, my technical skills in RUST do not allow me to conduct any accurate tests. In this example, I compiled my application with DUCKDB and SQLITE and performed a selection of 1000 items each (the database is exactly the same as in the example that I added)

image

The difference is huge and is deteriorating exponentially

Mause commented 7 months ago

Sounds like the sample code you provided isn't actually representative of your code then?

wilful commented 7 months ago

Unfortunately, I can't demonstrate the entire code to everyone. But I have highlighted all the functions that use databases. Otherwise, the launch is identical. =(