Cubert is a fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop.
Cubert Documentation hosted at github.
Cubert Users Google Group: cubert-users. Email: cubert-users@googlegroups.com
Cubert is ideally suited for the following application domains:
Statistical Calculations, Joins and Aggregations
Cubert introduces a new model of computation that allows users to organize data in a format that is ideally suited for scalable execution of subsequent query processing operators, and a set of algorithmically-efficient operators (MeshJoin and CUBE) that exploit the organization to provide significantly improved CPU and resource utilization compared to existing solutions.
Cubes and Grouping Set Aggregations
The power-horse is the new CUBE operator that can efficiently (CPU and memory) compute additive, non-additive (e.g. Count Distinct) and exact percentile rank (e.g. Median) statistics; can roll up inner dimensions on-the-fly and compute multiple measures within a single job.
Time range calculation and Incremental computations
Cubert primitives are specially suited for reporting workflows that employ computation pattern that is both regular and repetitive, allowing for efficiency gains from partial result caching and incremental processing.
Graph computations
Cubert provides a novel sparse matrix multiplication algorithm that is best suited for analytics with large-scale graphs.
When performance or resources are a matter of concern
Cubert Script is a developer-friendly language that takes out the hints, guesswork and surprises when running the script. The script provides the developers complete control over the execution plan (without resorting to low-level programming!), and is extremely extensible by adding new functions, aggregators and even operators.
Cubert script is a physical script where we explicitly define the operators at the Mappers, Reducers and Combiners for the different jobs. Following is an example of the Word Count problem written in cubert script.
JOB "word count job"
REDUCERS 10;
MAP {
// load the input data set as a TEXT file
input = LOAD "$CUBERT_HOME/examples/words.txt" USING TEXT("schema": "STRING word");
// add a column to each tuple
with_count = FROM input GENERATE word, 1 AS count;
}
// shuffle the data and also invoke combiner to aggregate on map-side
SHUFFLE with_count PARTITIONED ON word AGGREGATES COUNT(count) AS count;
REDUCE {
// at the reducers, sum the counts for each word
output = GROUP with_count BY word AGGREGATES SUM(count) AS count;
}
// store the output using TEXT format
STORE output INTO "output" USING TEXT();
END
While the Cubert Script code above is already very concise representation of the Word Count problem; as a matter of interest, the idiomatic way of writing in Cubert is even more concise (and a lot faster)!
JOB "idiomatic word count program (even more concise!)"
REDUCERS 10;
MAP {
input = LOAD "$CUBERT_HOME/examples/words.txt" USING TEXT("schema": "STRING word");
}
CUBE input BY word AGGREGATES COUNT(word) AS count GROUPING SETS (word);
STORE input INTO "output" USING TEXT();
END
Download or clone the repository (say, into /path/to/cubert) and run the following command:
$ cd /path/to/cubert
$ ./gradlew
This will create a folder /path/to/cubert/release, which is what we will need to run cubert. This folder can be copied to hadoop cluster gateway.
To run cubert, first make sure that Hadoop is installed and the HADOOP_HOME environment variable points to the hadoop installation. Set the CUBERT_HOME environment to the release folder (note: CUBERT_HOME points to the release folder and not the "root" repository folder).
$ export CUBERT_HOME=/path/to/cubert/release
$ $CUBERT_HOME/bin/cubert -h
Using HADOOP_CLASSPATH=:/path/to/cubert/release/lib/*
usage: ScriptExecutor <cubert script file> [options]
-c,--compile stop after compilation
-D <property=value> use value for given property
-d,--debug print debuging information
...
Example Scripts: Sample scripts are available in the $CUBERT_HOME/examples folder.
Cubert provides a rich suite of operators for processing data. These include:
Users Guide and Javadoc available at
http://linkedin.github.io/Cubert
Cubert Users Google Group: cubert-users
Email: cubert-users@googlegroups.com