Cebes - The integrated framework for Data Science at Scale
To build Cebes, you need JDK 1.8+, sbt and a SQL database.
By default, Cebes uses MariaDB connector, and should be compatible with most SQL databases.
This will create databases and users with default credentials for Cebes. You only need to do this once.
mysql.server start
./cebes-http-server/script/setup-db.sh
Note that the usernames and passwords in setup-db.sh
are default values.
For production settings, you may want to change them to something more secured.
cp bin/env.sh.example bin/env.sh
bin/test-all.sh
Test coverage report can be exported with the --coverage
option:
bin/test-all.sh --coverage=true # or bin/test-all.sh -c=true
There will be some unittests skipped. Those are tests with AWS.
If you have an AWS account, you can enable those tests by setting the CEBES_TEST_AWS_ACCESSKEY
and CEBES_TEST_AWS_SECRETKEY
variables in bin/env.sh
.
The tests will run Spark in local mode.
You can run Cebes in a Docker container, locally or on a Spark cluster.
Cebes can be included in a Docker image with Spark running in local mode. To build the docker image:
sbt clean compile assembly
docker build -t cebes -f docker/http-server/Dockerfile .
The docker image contains everything needed by Cebes, including a MariaDB instance. It can then be run as:
docker run -it -p 21000:21000 -p 4040:4040 --name cebes-server cebes
The docker image exposes a data volume at /cebes/data
containing Cebes logs, MariaDB databases and
Hive warehouse used in Spark. If you want to keep the data persisted, mount it to a local directory:
docker run -it -p 21000:21000 -p 4040:4040 -v $HOME/cebes-data:/cebes/data --name cebes-server cebes
To check if the Cebes server is up and running:
curl localhost:21000/version
{"api":"v1"}
The Spark UI can be accessed at http://localhost:4040
Using Docker is more preferred, but if you want you can also run Cebes with Spark in local mode:
# start MySQL server
mysql.server restart
# compile and assembly Cebes
sbt clean compile assembly
# Download Spark and put it under ./spark
./bin/get-spark.sh
# submit Cebes to Spark.
./bin/start-cebes.sh
By default Cebes server will listen on port 21000 (configurable).
Use spark-submit
script to submit the Cebes assembly jar like any other Spark application:
sbt clean compile assembly
CEBES_JAR=`find ./cebes-http-server/target/scala-2.11 -name cebes-http-server-assembly-*.jar | head -n 1`
${SPARK_HOME}/bin/spark-submit --class "io.cebes.server.Main" \
--master "yarn" \
--conf 'spark.driver.extraJavaOptions=-Dcebes.logs.dir=/tmp/' \
${CEBES_JAR}
See Spark documentation for advanced options.
Cebes uses something similar to guice-property for environment variables.
All the variables are defined in Property.java
in the cebes-properties
module.
By default, the whole project use scala-logging
with the slf4j-log4j12
backend.
The configuration of log4j
can be found in log4j.properties
in each module of the project.
During tests, the resulting log files are normally named ${cebes.log.dir}<module_name>-test.log
.
In production, the resulting log file is named ${cebes.log.dir}cebes-http-server.log
, and rolled daily.
Spark
has some nasty dependencies (DataNucleus
and parquet
), who
use either java.util.logging
or hard-coded log4j
. For this, we tried our best
to mute them in cebes-http-server
with the log4j.properties
and parquet.logging.properties
files.
It seems impossible to mute them in cebes-spark
though.
Projects that expose RESTful APIs (cebes-http-server
, cebes-pipeline-repository
, cebes-pipeline-serving
) include
swagger definitions in src/swagger
of respective directories. Check https://swagger.io/ for tools to generate nice-looking
UI out of that.
At the moment, some APIs might not be fixed and their swagger documentation might be not there yet. Contribute if you find something missing!
Fork the project, create an issue if you find bugs or have a feature request.
Join us on gitter to interact with people. If you prefer the good old way, drop a message to our mailing list.