artemis-analytics / artemis

Apache License 2.0
5 stars 2 forks source link


Artemis -- Administrative data science framework powered by Apache Arrow™.

The Artemis data science frameowrk is a record batch based data processing framework, powered by Apache Arrow open source data format standard, for the production of high-quality adminstrative data for analytical purposes. Statistical organizations are shifting to an adminstrative data first approach for producing official statistics. The production of high-quality, fit-for-use administrative data must preserve the raw state of the data throughout the data life cycle (ingestion, integration, management, processing, and analysis). Data formats and production frameworks must support changes to analytical workloads which have different data access patterns than traditional survey data, efficient iteration on the data at any stage in the data life cycle, and statistical tools for continous data quality and fit-for-use assessement. The Artemis data science framework at the core relies on the well-defined, cross-lanaguage, Apache Arrow data format that accelerates analytical processing of the data on modern computing architecture.

Artemis framework primary objectives:

Getting Started

Artemis project relies on conda as an environment manager and build tool. The project has one external dependency, the Fixed-width file reader (stcdatascience/fwfr.git) that needs to be built.

Development environment

mkdir <workspace>
cd <workspace>
git clone
git clone
conda env create -f artemis/environment.yaml
conda activate artemis-dev
cd fwfr
./ --source
cd ../artemis
python build_ext --inplace install
python -m unittest

Framework components

The Artemis primary objective is the production of datasets which utilize memory and cpu resources efficiently to accelerate analytical processing of data on a single-node, multi-core machine. The data processing can be replicated on independent parts of the dataset in parallel across multiple cores and / or across multiple nodes of a computing cluster in a batch-oriented fashion. The resulting dataset consists of a collection of output file(s), each file is organized as a collection of record batches. Each record batch consists of the same number of records in each batch with a fixed, equivalent schema. The collection of record batches in a file can be considered a table. The columnar data structure is highly compressible and retains the schema as well as the entire payload in a given file. Arrow supports both streams and random access reads of record batches, resulting in efficient and effective data management.

The control flow from the ingestion of raw data to the production of Arrow datasets proceeds as follows.
The raw dataset consists of one or more datums, such as files, database tables, or any data partition. In order to organize the data into collections of a fixed number of record batches to manage the data in-memory, each datum is separated into chunks of fixed size in bytes. The datum is read into native Arrow buffers directly and all processing is performed on these buffers. The in-memory native Arrow buffers are collected and organized as collections of record batches, through data conversion algorithms, in order to build new on-disk datasets given the stream of record batches.

In order to support any arbitrary, user-defined data transformation, the Artemis framework defines a common set of base classes for user defined Chains, representing business processes, as an ordered collection of algorithms and tools which transform data. User-defined algorithms inherit methods which are invoked by the Artemis application, such that the Chains are managed by a Steering algorithm. Artemis manages the data processing Event Loop, provides data to the algorithms, and handles data serialization and job finalization.

Data formats




In order to run Artemis, a protocol buffer message must be defined and stored, conforming to the artemis.proto metadata model, defined in artemis/io/protobuf/artemis.proto.

Build & Deploy

To build Artemis, cd to the root of the artemis repository. Follow the instructions below.

conda env create -f environment.yml
conda activate artemis-dev
git clone "FWFR GIT REPO"
conda install conda-build
conda build conda-recipes
mv "PATH TO CONDA"/envs/artemis-dev/conda-bld/broken/artemis-"VERSION".tar.bz2 ./
conda deactivate
bash release/ -e artemis-dev -n artemis-pack -p artemis-"VERSION" -r "PATH TO ARTEMIS REPO"

This will result in a package called "artemis-pack.tar.gz". You can move this to anywhere you wish to deploy.

You can install the created package file with the "" script.


Artemis Release and Tag Management

During a new Artemis release, the commit that will be released needs to be tagged with the new version tag, of the format X.Y.Z.

It is important to update the file with the new Artemis version.

Building the Protobuf

Artemis metadata is defined in io/protobuf/artemis.proto. An important component of the metadata are histograms. Histograms are provided by the physt package which includes io functionality to/from protobuf. However, the proto file is not distributed with the package. This requires building the protobuf with a copy of the histogram.proto class.

To build (from the io/protobuf directory)

protoc -I=./ --python_out=./ ./artemis.proto

Artemis Job Example

An example job is available in examples/ which involves extracting dataset schema from Excel, generating synthetic data, performing data analytics algorithms,and outputs distributions for data profiling. Ensure that Artemis is built, then, run the following command.

python examples/ --location ./examples/data/example_product.xlsx

The example schema is located in examples/data/example_product.xlsx. To create new dataset schemas please see instructions in artemis/tools/Excel_template/