duckdb / duckdb-swift

MIT License
80 stars 17 forks source link

Slow performance with larger databases #3

Open edko99 opened 2 weeks ago

edko99 commented 2 weeks ago

I am experimenting with DuckDB on iOS and visionOS. I was able to make it run both on the simulator and on device with ease. However, when working with larger databases, the performance degrades. It does make use of multiple threads, but does not take advantage of available memory.

I've created a default SwiftUI project and have added DuckDB 1.1.2 as a dependency. Using the CLI, I have created a test database and have saved it in a folder inside the project. The commands to create the database were:

INSTALL tpch;
LOAD tpch;
CALL dbgen(sf = 1);

CREATE OR REPLACE VIEW order_details AS
FROM orders, lineitem, customer, part, nation, region
WHERE
   l_orderkey = o_orderkey and
   o_custkey = c_custkey and
   l_partkey = p_partkey and
   c_nationkey = n_nationkey and
   n_regionkey = r_regionkey;

I called the database tpch.duckdb. The project's structure looks like this:

Project Structure

The content of ContentView.swift is:

import SwiftUI
import TabularData
import DuckDB
import Foundation

struct ContentView: View {
    var body: some View {
        Button("test") {
            testDuckDB()
        }
    }
}

func testDuckDB(){
    let query = "FROM histogram(order_details, o_orderpriority)"
    guard let dbURL = copyDatabaseIfNeeded(dbName: "tpch.duckdb") else {
        return
    }
    print("copied database")
    let conf = Database.Configuration()
    try! conf.setValue("4GB", forKey: "memory_limit")
    try! conf.setValue("4GB", forKey: "max_memory")
    let db = try! Database(store: .file(at: dbURL), configuration: conf)
    print("opened database")
    let connection = try! db.connect()
    print("connected to database")
    let result = try! connection.query(query)
    print("queried database")
    let bin_column = result[0].cast(to: String.self)
    let count_column = result[1].cast(to: UInt64.self)
    let df = DataFrame(columns: [
                TabularData.Column(bin_column)
                    .eraseToAnyColumn(),
                TabularData.Column(count_column)
                    .eraseToAnyColumn(),
            ])
    print(df.debugDescription)
}

// Copies the database from bundle to documents folder, if it is not already there
func copyDatabaseIfNeeded(dbName: String) -> URL? {
    let fileManager = FileManager.default
    let documentsUrl = fileManager.urls(for: .documentDirectory, in: .userDomainMask)
    guard documentsUrl.count != 0 else {
        print("no documents directory")
        return nil
    }
    let finalDatabaseURL = documentsUrl.first!.appendingPathComponent(dbName)
    if !( (try? finalDatabaseURL.checkResourceIsReachable()) ?? false) {
        let documentsURL = Bundle.main.resourceURL?.appendingPathComponent(dbName)
        do {
            try fileManager.copyItem(atPath: (documentsURL?.path)!, toPath: finalDatabaseURL.path)
        } catch let error as NSError {
            print("Couldn't copy file to final location! Error:\(error.description)")
            return nil
        }

    }
    return finalDatabaseURL
}

When I run this (either on the simulator or a device) and press the "test" button, the query takes around 2 minutes. The same query on a Mac (CLI) returns almost immediately. It does use 10 threads but the memory usage does not go above 182MB, despite several GB being available (and having set memory_limit and max_memory to 4GB).

edko99 commented 2 days ago

This does not seem to be an iOS only problem. I created the same project targeting macOS, and it has exactly the same issue. DuckDB does not seem to allocate all the available memory that could potentially be available to it.