Stratio / deep-spark

Connecting Apache Spark with different data stores [DEPRECATED]
http://stratio.github.io/deep-spark
Apache License 2.0
197 stars 42 forks source link

*Disclaimer: As of 01/06/2015 this project has been deprecated. Thank you for your understanding and continued help throughout the project's life.

What is Deep?

Deep is a thin integration layer between Apache Spark and several NoSQL datastores. We actually support Apache Cassandra, MongoDB, Elastic Search, Aerospike, HDFS, S3 and any database accessible through JDBC, but in the near future we will add support for sever other datastores.

Install ojdbc driver

In order to compile the deep-jdbc module is necessary to add the Oracle ojdbc driver into your local repository. You can download it from the URL: http://www.oracle.com/technetwork/database/features/jdbc/default-2280470.html. When you are on the web you must click in "Accept License Agreement" and later downlad ojdbc7.jar library. You need a free oracle account to download the official driver.

To install the ojdbc driver in your local repository you must execute the command below:

mvn install:install-file -Dfile= -DgroupId=com.oracle -DartifactId=ojdbc7 -Dversion=12.1.0.2 -Dpackaging=jar

Compiling Deep

After that you can compile Deep executing the following steps:

cd deep-parent

mvn clean install

Creating Deep Dristribution

If you want to create a Deep distribution you must execute the following steps:

cd deep-scripts

make-distribution-deep.sh

During the creation you'll see the following question:

What tag want to use for Aerospike native repository?

You must type 0.7.0 and press enter.

Apache Cassandra integration

The integration is not based on the Cassandra's Hadoop interface.

Deep comes with an user friendly API that lets developers create Spark RDDs mapped to Cassandra column families. We provide two different interfaces:

We encourage you to read the more comprehensive documentation hosted on the Openstratio website.

Deep comes with an example sub project called 'deep-examples' containing a set of working examples, both in Java and Scala. Please, refer to the deep-example project README for further information on how to setup a working environment.

MongoDB integration

Spark-MongoDB connector is based on Hadoop-mongoDB.

Support for MongoDB has been added in version 0.3.0.

We provide two different interfaces:

We added a few working examples for MongoDB in deep-examples subproject, take a look at:

Entities:

Cells:

You can check out our first steps guide here:

First steps with Deep-MongoDB

We are working on further improvements!

ElasticSearch integration

Support for ElasticSearch has been added in version 0.5.0.

Aerospike integration

Support for Aerospike has been added in version 0.6.0.

Examples:

Entities:

Cells:

JDBC integration

Support for JDBC has been added in version 0.7.0.

Examples:

Entities:

Cells:

Requirements

Configure the development and test environment

Once you have a working development environment you can finally start testing Deep. This are the basic steps you will always have to perform in order to use Deep:

First steps with Spark and Cassandra

First steps with Spark and MongoDB

Migrating from version 0.2.9

From version 0.4.x, Deep supports multiple datastores, in order to correctly implement this new feature Deep has undergone an huge refactor between versions 0.2.9 and 0.4.x. To port your code to the new version you should take into account a few changes we made.

New Project Structure

From version 0.4.x, Deep supports multiple datastores, in your project you should import only the maven dependency you will use: deep-cassandra, deep-mongodb, deep-elasticsearch or deep-aerospike.

Changes to 'com.stratio.deep.entity.Cells'

Examples:

Cells cells1 = new Cells(); // instantiate a Cells object whose default table name is generated internally.
Cells cells2 = new Cells("my_default_table"); // creates a new Cells object whose default table name is specified by the user
cells2.add(new Cell(...)); // adds to the 'cells2' object a new Cell object associated to the default table
cells2.add("my_other_table", new Cell(...)); // adds to the 'cells2' object a new Cell associated to "my_other_table"  

Changes to objects hierarchy

RDD creation

Methods used to create Cell and Entity RDD has been merged into one single method: