KxSystems / arrowkdb

kdb+ integration with Apache Arrow and Parquet
https://code.kx.com/q/interfaces
Apache License 2.0
28 stars 12 forks source link
arrow kdb parquet q

arrowkdb

Arrow

GitHub release (latest by date) Travis (.com) branch

Introduction

This interface allows kdb+ to users read and write Apache Arrow data stored in:

This is part of the Fusion for kdb+ interface collection.

New to kdb+ ?

Kdb+ is the world's fastest time-series database, optimized for ingesting, analyzing and storing massive amounts of structured data. To get started with kdb+, please visit https://code.kx.com/q/ for downloads and developer information. For general information, visit https://kx.com/

New to Apache Arrow?

Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system (or programming language to another).

A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (included nested data types) designed to support the needs of analytic database systems, data frame libraries, and more.

What is the difference between Apache Arrow and Apache Parquet?

Parquet is a storage format designed for maximum space efficiency, using advanced compression and encoding techniques. It is ideal when wanting to minimize disk usage while storing gigabytes of data, or perhaps more. This efficiency comes at the cost of relatively expensive reading into memory, as Parquet data cannot be directly operated on but must be decoded in large chunks.

Conversely, Arrow is an in-memory format meant for direct and efficient use for computational purposes. Arrow data is not compressed but laid out in natural format for the CPU, so that data can be accessed at arbitrary places at full speed. Therefore, Arrow and Parquet complement each other with Arrow being used as the in-memory data structure for deserializing Parquet data.

Installation

Requirements

:warning: If using the packaged version of arrowkdb you should install version 9.0.0 of Apache Arrow

Third-party library installation

Linux

Follow the instructions here to install libarrow-dev and libparquet-dev from Apache's APT or Yum repositories.

Note: If using the packaged version of arrowkdb you should install version 9.0.0 of both:

sudo apt install -y -V libarrow-dev=9.0.0-1
sudo apt install -y -V libparquet-dev=9.0.0-1

macOS

Follow the instructions here to install apache-arrow using Homebrew.

Windows

On Windows it is necessary to build Arrow from source. Full details are provided here but the basic steps are as follows.

From a Visual Studio command prompt, clone the Arrow source from github:

C:\Git> git clone https://github.com/apache/arrow.git
C:\Git> cd arrow

Switch to the 9.0.0 tag:

C:\Git\arrow> git checkout refs/tags/apache-arrow-9.0.0 --
C:\Git> cd cpp

Create an install directory and set an environment variable to this directory (substituting the correct absolute path as appropriate). This environment variable is used again later when building arrowkdb:

C:\Git\arrow\cpp> mkdir install
C:\Git\arrow\cpp> set ARROW_INSTALL=C:\Git\arrow\cpp\install

Create the CMake build directory and generate the build files (this will default to using the Visual Studio CMake generator when run from a VS command prompt):

C:\Git\arrow\cpp> mkdir build
C:\Git\arrow\cpp> cd build
C:\Git\arrow\cpp\build> cmake .. -DARROW_PARQUET=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON -DARROW_WITH_BROTLI=ON -DARROW_BUILD_STATIC=OFF -DARROW_COMPUTE=OFF -DARROW_DEPENDENCY_USE_SHARED=OFF -DCMAKE_INSTALL_PREFIX=%ARROW_INSTALL%

Build and install Arrow:

C:\Git\arrow\cpp\build> cmake --build . --config Release
C:\Git\arrow\cpp\build> cmake --build . --config Release --target install

Copy the Arrow, Parquet and compression DLLs to the %QHOME%\w64 directory:

C:\Git\arrow\cpp\build> copy release\Release\*.dll %QHOME%\w64

Installing a release

It is recommended that a user install this interface through a release. This is completed in a number of steps:

  1. Ensure you have downloaded/installed the Arrow C++ API following the instructions.
  2. Download a release for your system architecture.
  3. Install script arrowkdb.q to $QHOME, and binary file lib/arrowkdb.(so|dll) to $QHOME/[mlw](64), by executing the following from the Release directory:
## Linux/macOS
chmod +x install.sh && ./install.sh

## Windows
install.bat

Building and installing from source

In order to successfully build and install this interface from source, the following environment variables must be set:

  1. ARROW_INSTALL = Location of the Arrow C++ API release (only required if Arrow is not installed globally on the system, e.g. on Windows where Arrow was built from source)
  2. QHOME = Q installation directory (directory containing q.k)

From a shell prompt (on Linux/macOS) or Visual Studio command prompt (on Windows), clone the arrowkdb source from github:

git clone https://github.com/KxSystems/arrowkdb.git
cd arrowkdb

Create the CMake build directory and generate the build files (this will use the system's default CMake generator):

mkdir build
cd build

## Linux/MacOS
cmake ..

## Windows (using the Arrow installation which was build from source as above)
cmake .. -DARROW_INSTALL=%ARROW_INSTALL%

Start the build:

cmake --build . --config Release

Create the install package and deploy:

cmake --build . --config Release --target install

Documentation

Documentation outlining the functionality available for this interface can be found in the docs folder.

Status

The arrowkdb interface is provided here under an Apache 2.0 license.

If you find issues with the interface or have feature requests, please consider raising an issue.

If you wish to contribute to this project, please follow the contribution guide.