Segmentation fault on `try_clone` misuse

max-celus commented 4 months ago

After spending some time in a multithreaded program using duckdb I ran into a segmentation fault. I can be a bit more careful in my Rust code, but I was wondering if I could fix it for you. To do so I need some guidance in the way the underlying API works. The failing test can be viewed here. Its code:

    #[test]
    fn test_clone_after_close() {
        // Additional querying test to make sure our connections are still
        // usable. The original crash would happen without doing any queries.
        fn assert_can_query(conn: &Connection) {
            conn.execute("INSERT INTO test (c1) VALUES (1)", []).expect("insert");
        }

        // 1. Open owned connection
        let owned = checked_memory_handle();
        owned
            .execute_batch("create table test (c1 bigint)")
            .expect("create table");
        assert_can_query(&owned);

        // 2. Create a first clone from owned
        let clone1 = owned.try_clone().expect("first clone");
        assert_can_query(&owned);

        // 3. Close owned connection
        drop(owned);
        assert_can_query(&clone1);

        // 4. Create a second clone from the first clone. Crashes on the inner
        //    `duckdb_connect` with a segmentation fault.
        let clone2 = clone1.try_clone().expect("second clone");
        assert_can_query(&clone1);
        assert_can_query(&clone2);

        // 5. Small additional test
        drop(clone1);
        assert_can_query(&clone2);
    }

The problem is calling InnerConnection::new(...) when the ffi::duckdb_database is set to a null pointer after closing.

I would like to fix it, but there is some overhead involved. And I need some guidance on the internals of the C-interface to duckdb. We could put the ffi::duckdb_database behind some reference counting (like an Arc), and only call ffi::duckdb_close(...) when that database object is dropped. This would introduce an extra pointer indirection through the Arc and the overhead of atomic operations.

But to then make the entire implementation a bit more sound we would need to change the calls:

Connection::open_from_raw: as this would now require an Arc-wrapped object.
InnerConnection::close(...): must be changed to consume self. Which causes a problem in Connection::close(...) with the Connection::db, which is a RefCell<InnerConnection>. As we would have to consume that inner connection as well. To solve this (and to reflect the fact that one cannot share a Connection across threads anyway, only move it between threads, is to make the methods on Connection take a &mut self parameter.

Here is where I stopped checking out the code. As I would like to hear if you would even want to see such far-reaching changes to the API.

Perhaps there is another (simpler) solution, some options, with varying degrees of ugliness:

Have an OwnedConnection that actually owns the InnerConnection. Only that OwnedConnection supports try_clone. Can be implemented with typestate as Connection<Owned>.
Perhaps just returning an Err from try_clone when we detect that the "owned" InnerConnection has been closed, but this would require some thread-safe sharing of a flag, with the implied overhead.

I'd love to hear from you. For now I'll just keep the "owned" InnerConnection in a very special place in my code :).

era127 commented 4 months ago

I think work around is to keep a reference 'owned' and then clone it to pass make new connections to pass to threads, as it is done in the r2d2 connection pool manager type. Have you looked at that yet? It will maintain the reference for you.

Ideally, the database and connection structures should be separated as they are done in other client api's into a Database and a Connection type; where you first create a Database, and then you pass an Arc to each task and create a local Connection inside each task. The atomic operation would only be when creating the connection within a task. Unfortunately it is not structured that way because the crate was originally derived from rusqlite, but there should be no technical issue preventing the creation of thousands of connections across many tasks. It would require the existing Connection type to be refactored however.

max-celus commented 4 months ago

Thanks for the reply. I already did something similar inside my own code.

I'll await a reply on perhaps changing the API, which I'd be more than happy to do. But if there is no desire to change it then feel free to close the issue :)

duckdb / duckdb-rs

Segmentation fault on `try_clone` misuse #312