freevryheid / duckdb

fortran bindings to duckdb c api
MIT License
9 stars 2 forks source link
bindings database duckdb fortran

duckdb

fortran bindings to duckdb c api

Introduction

DuckDB, at the time of this writing, is at version 1.0.0 (stable release). The fortran module in this repository wraps the C-API. While it is still under development, most of the api functions have been wrapped, allowing access to DuckDB databases, querying and data extraction.

DuckDB provides column-based data storage in contrast to other databases (SQLIte, PostgreSQL, etc) that are row-based. To use the API requires some basic understanding of the data structure.

Data structure

DuckDB allows extracting data directly from csv, parquet and json files, even if compressed:

select * from 'data.csv.gz'

DuckDB databases can be :in-memory: or file-based. The wrapper provides functions to initialize the database and connection as shown below. In-memory databases are used if no path is provided in the duckdb_open function, which is optional.


type(duckdb_database) :: db
type(duckdb_connection) :: con

if (duckdb_open(db=db) == duckdberror) then
  error stop "open error"
end if

if (duckdb_connect(db, con) == duckdberror) then
  call duckdb_close(db)
  error stop "connect error"
end if

call duckdb_disconnect(con)
call duckdb_close(db)

Once a connection to the database has been established, it can be queried to extract results. Note that SQL strings passed to DuckDB must be null-terminated.

use, intrinsic :: iso_c_binding

type(duckdb_connection) :: con
type(duckdb_result) :: res
integer(kind(duckdb_state)) :: ri

sql = "select * from '" // path // "';" // c_null_char
r = duckdb_query(con, sql, res)

deallocate(sql)

The recommended way to interact with result sets is using chunks and vectors. In DuckDB a chunk is defined as a dataset having a fixed number of rows. This number is configurable but is set at 2048 by default. Chunks can be extracted from result sets using the following functions:

  use, intrinsic :: iso_fortran_env, only : int64
  type(duckdb_data_chunk) :: chk ! chunk
  integer(kind=int64) :: i, nc
  type(duckdb_result) :: res ! result set

  nc = duckdb_result_chunk_count(res) ! number of chunks
  do i = 0_int64, nc
    chk = duckdb_result_get_chunk(res, i)
    ! do somethink with chunk
    ! ...
    call duckdb_destroy_data_chunk(chk)
  end do

Vectors in turn are extracted from chunks. DuckDB vectors are column based data having a specific data type. It is important to define the data type in fortran when extracting these vectors. Consider the folllowing dataset that comprises 3 vectors, all of type int64.

column0 column1 column2 int64 int64 int64
1 2 3
4 5 6
7 8 9

The functions below outline one possible way to extract data from vectors. These data are returned as a c_ptr, which could be converted into a fortran pointer without the need for allocating additional memory. These pointers will only be available though while the chunk is still active and are lost if the chunk is destroyed or if a new chunk is extracted from the result set.

The code below demonstrates how vector data pointers, that may comprise multiple columns, may be consumed by fortran using a derived type defined in the data type of the result set. Note that DuckDB provides functions to check the validity of the data which could include missing or NULL data.


type vectors
  integer(kind=int64), pointer, dimension(:) :: ptr
end type

type(duckdb_data_chunk) :: chk
type(vectors), allocatable, dimension(:) :: vecs
type(duckdb_vector) :: vec
integer(kind=int64) :: j, rows, cols
type(c_ptr) :: va !, vb
integer(kind=int64), pointer, dimension(:) :: a
integer(kind=int64), pointer, dimension(:,:) :: mat ! 2d array
integer :: i
rows = duckdb_data_chunk_get_size(chk)
cols = duckdb_data_chunk_get_column_count(chk)
allocate(vecs(cols))
do j = 0_int64, cols - 1
  vec = duckdb_data_chunk_get_vector(chk, j)
  va = duckdb_vector_get_data(vec)
  ! vb = duckdb_vector_get_validity(vec)
  call c_f_pointer(va, a, [rows])
  vecs(j+1)%ptr => a
  ! not cared with validity check
  ! do k = 0, sc - 1
  ! if (duckdb_validity_row_is_valid(vb, k)) then
  !   print *, vr(k+1)
  ! else
  !   print *, "NULL"
  ! end if
end do

if (allocated(vecs)) then
  allocate(mat(3,3))
  do i = 1, 3
    mat(1:3, i) = vecs(i)%ptr
  end do

  print *, mat

  deallocate(mat)

end if

deallocate(vecs)

call duckdb_destroy_data_chunk(chk)
call duckdb_destroy_result(res)
call duckdb_disconnect(con)
call duckdb_close(db)

An example of extracting data from a csv file is provided in the example folder.

Implementation status

Setup and test

Requires the c library that can be downloaded from https://github.com/duckdb/duckdb/releases. If you're on archlinux you can install the libraries and headers using "yay duckdb-bin", which includes the cli binary.

Minimum Duckdb version required: 0.8

Test with

fpm test

To include this in your own projects, add this dependency to your fpm.toml:

[dependencies]
duckdb.git = "https://github.com/freevryheid/duckdb"