Artemis -- Administrative data science framework powered by Apache Arrow™.
The Artemis data science frameowrk is a record batch based data processing framework, powered by Apache Arrow open source data format standard, for the production of high-quality adminstrative data for analytical purposes. Statistical organizations are shifting to an adminstrative data first approach for producing official statistics. The production of high-quality, fit-for-use administrative data must preserve the raw state of the data throughout the data life cycle (ingestion, integration, management, processing, and analysis). Data formats and production frameworks must support changes to analytical workloads which have different data access patterns than traditional survey data, efficient iteration on the data at any stage in the data life cycle, and statistical tools for continous data quality and fit-for-use assessement. The Artemis data science framework at the core relies on the well-defined, cross-lanaguage, Apache Arrow data format that accelerates analytical processing of the data on modern computing architecture.
Artemis framework primary objectives:
Artemis project relies on conda as an environment manager and build tool. The project has one external dependency, the Fixed-width file reader (stcdatascience/fwfr.git) that needs to be built.
mkdir <workspace>
cd <workspace>
git clone https://github.com/ryanmwhitephd/artemis.git
git clone https://github.com/ke-noel/fwfr.git
conda env create -f artemis/environment.yaml
conda activate artemis-dev
cd fwfr
./install.sh --source
cd ../artemis
python setup.py build_ext --inplace install
python -m unittest
The Artemis primary objective is the production of datasets which utilize memory and cpu resources efficiently to accelerate analytical processing of data on a single-node, multi-core machine. The data processing can be replicated on independent parts of the dataset in parallel across multiple cores and / or across multiple nodes of a computing cluster in a batch-oriented fashion. The resulting dataset consists of a collection of output file(s), each file is organized as a collection of record batches. Each record batch consists of the same number of records in each batch with a fixed, equivalent schema. The collection of record batches in a file can be considered a table. The columnar data structure is highly compressible and retains the schema as well as the entire payload in a given file. Arrow supports both streams and random access reads of record batches, resulting in efficient and effective data management.
The control flow from the ingestion of raw data to the production of Arrow datasets proceeds as follows.
The raw dataset consists of one or more datums, such as files, database tables, or any data partition.
In order to organize the data into collections of a fixed number of record batches to manage the data in-memory,
each datum is separated into chunks of fixed size in bytes. The datum is read into native Arrow buffers directly
and all processing is performed on these buffers. The in-memory native Arrow buffers are collected and organized
as collections of record batches, through data conversion algorithms, in order to build new on-disk datasets
given the stream of record batches.
In order to support any arbitrary, user-defined data transformation, the Artemis framework defines a common set of base classes for user defined Chains, representing business processes, as an ordered collection of algorithms and tools which transform data. User-defined algorithms inherit methods which are invoked by the Artemis application, such that the Chains are managed by a Steering algorithm. Artemis manages the data processing Event Loop, provides data to the algorithms, and handles data serialization and job finalization.
Data formats
Components
Inputs
Outputs
In order to run Artemis, a protocol buffer message must be defined and stored, conforming to the
artemis.proto metadata model, defined in artemis/io/protobuf/artemis.proto
.
To build Artemis, cd to the root of the artemis repository. Follow the instructions below.
conda env create -f environment.yml
conda activate artemis-dev
git clone "FWFR GIT REPO"
conda install conda-build
conda build conda-recipes
mv "PATH TO CONDA"/envs/artemis-dev/conda-bld/broken/artemis-"VERSION".tar.bz2 ./
conda deactivate
bash release/package.sh -e artemis-dev -n artemis-pack -p artemis-"VERSION" -r "PATH TO ARTEMIS REPO"
This will result in a package called "artemis-pack.tar.gz". You can move this to anywhere you wish to deploy.
You can install the created package file with the "deploy.sh" script.
bash deploy.sh -e "NAME OF CONDA ENV TO CREATE" -n "NAME OF PACKAGE FILE" -p "NAME OF PACKAGE"
During a new Artemis release, the commit that will be released needs to be tagged with the new version tag, of the format X.Y.Z.
It is important to update the setup.py file with the new Artemis version.
Artemis metadata is defined in io/protobuf/artemis.proto. An important component of the metadata are histograms. Histograms are provided by the physt package which includes io functionality to/from protobuf. However, the proto file is not distributed with the package. This requires building the protobuf with a copy of the histogram.proto class.
To build (from the io/protobuf directory)
protoc -I=./ --python_out=./ ./artemis.proto
An example job is available in examples/distributed_example_2.py which involves extracting dataset schema from Excel, generating synthetic data, performing data analytics algorithms,and outputs distributions for data profiling. Ensure that Artemis is built, then, run the following command.
python examples/distribucted_example_2.py --location ./examples/data/example_product.xlsx
The example schema is located in examples/data/example_product.xlsx. To create new dataset schemas please see instructions in artemis/tools/Excel_template/README.md