TIBCOSoftware / snappydata

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster
http://www.snappydata.io
Other
1.04k stars 203 forks source link
analytics memory-database scale snappydata spark stream transaction

This repository is provided for legacy users and informational purposes only. It may contain security vulnerabilities in the code itself or its dependencies. TIBCO provides no updates, including security updates, to this code. Consistent with the terms of the Apache License 2.0 that apply to the TIBCO code in this repository, the code is provided on an "as is" basis, without any warranties or conditions of any kind and in no event and under no legal theory shall TIBCO be liable to you for damages arising as a result of the use or inability to use the code.

Introduction

SnappyData (aka TIBCO ComputeDB) is a distributed, in-memory optimized analytics database. SnappyData delivers high throughput, low latency, and high concurrency for unified analytics workload. By fusing an in-memory hybrid database inside Apache Spark, it provides analytic query processing, mutability/transactions, access to virtually all big data sources and stream processing all in one unified cluster.

One common use case for SnappyData is to provide analytics at interactive speeds over large volumes of data with minimal or no pre-processing of the dataset. For instance, there is no need to often pre-aggregate/reduce or generate cubes over your large data sets for ad-hoc visual analytics. This is made possible by smartly managing data in-memory, dynamically generating code using vectorization optimizations and maximizing the potential of modern multi-core CPUs. SnappyData enables complex processing on large data sets in sub-second timeframes.

SnappyData Positioning

!!!Note SnappyData is not another Enterprise Data Warehouse (EDW) platform, but rather a high performance computational and caching cluster that augments traditional EDWs and data lakes.

Important Capabilities

Downloading and Installing SnappyData

You can download and install the latest version of SnappyData from github. Refer to the documentation for installation steps.

Getting Started

Multiple options are provided to get started with SnappyData. Easiest way to get going with SnappyData is on your laptop. You can also use any of the following options:

You can find more information on options for running SnappyData here.

Quick Test to Measure Performance of SnappyData vs Apache Spark

If you are already using Apache Spark, you can experience upto 20x speedup for your query performance with SnappyData. Try this test using the Spark Shell.

Documentation

To understand SnappyData and its features refer to the documentation.

Other Relevant content

Community Support

We monitor the following channels comments/questions:

Link with SnappyData Distribution

Using Maven Dependency

SnappyData artifacts are hosted in Maven Central. You can add a Maven dependency with the following coordinates:

groupId: io.snappydata
artifactId: snappydata-cluster_2.11
version: 1.3.1

Also add cloudera repository to the set of Maven repositories to be searched:

  <repositories>
    <repository>
      <id>cloudera-repo</id>
      <name>cloudera repo</name>
      <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
    </repository>
    ...
  </repositories>

Using Gradle Dependency

If you are using Gradle, add this to your build.gradle for core SnappyData artifacts:

dependencies {
  implementation 'io.snappydata:snappydata-core_2.11:1.3.1'
  ...
}

For additions related to SnappyData cluster, use:

dependencies {
  implementation 'io.snappydata:snappydata-cluster_2.11:1.3.1'
  ...
}

Also add cloudera repository to the set of Maven repositories to be searched:

repositories {
  mavenCentral()
  maven { url 'https://repository.cloudera.com/artifactory/cloudera-repos' }
  ...
}

Using SBT Dependency

If you are using SBT, add this line to your build.sbt for core SnappyData artifacts:

libraryDependencies += "io.snappydata" % "snappydata-core_2.11" % "1.3.1"

For additions related to SnappyData cluster, use:

libraryDependencies += "io.snappydata" % "snappydata-cluster_2.11" % "1.3.1"

Also add cloudera repository to the set of Maven repositories to be searched:

resolvers += "Cloudera Repo" at "https://repository.cloudera.com/artifactory/cloudera-repos"

You can find more specific SnappyData artifacts here

!!!Note If your project fails when resolving the above dependency (that is, it fails to download javax.ws.rs#javax.ws.rs-api;2.1), it may be due an issue with its pom file.
As a workaround, you can add the below code to your build.sbt:

val workaround = {
  sys.props += "packaging.type" -> "jar"
  ()
}

For more details, refer https://github.com/sbt/sbt/issues/3618.

Building from Source

If you would like to build SnappyData from source, refer to the documentation on building from source.

How is SnappyData Different than Apache Spark?

Apache Spark is a general purpose parallel computational engine for analytics at scale. At its core, it has a batch design center and is capable of working with disparate data sources. While this provides rich unified access to data, this can also be quite inefficient and expensive. Analytic processing requires massive data sets to be repeatedly copied and data to be reformatted to suit Apache Spark. In many cases, it ultimately fails to deliver the promise of interactive analytic performance. For instance, each time an aggregation is run on a large Cassandra table, it necessitates streaming the entire table into Apache Spark to do the aggregation. Caching within Apache Spark is immutable and results in stale insight.

The SnappyData Approach

Snappy Architecture

SnappyData Architecture

SnappyData takes a different approach. SnappyData fuses a low latency, highly available in-memory transactional database ((Pivotal GemFire/Apache Geode) into Apache Spark with shared memory management and optimizations. Data can be managed in columnar form similar to Apache Spark caching or in a row oriented manner commonly used in popular relational databases like postgres). But, many query engine operators are significantly more optimized through better vectorization, code generation and indexing.
The net effect is, an order of magnitude performance improvement when compared to native Apache Spark caching, and more than two orders of magnitude better performance when Apache Spark is used in conjunction with external data sources. Apache Spark is turned into an in-memory operational database capable of transactions, point reads, writes, working with Streams (Apache Spark) and running analytic SQL queries without losing the computational richness in Apache Spark.

Streaming Example - Ad Analytics

Here is a stream + Transactions + Analytics use case example to illustrate the SQL as well as the Apache Spark programming approaches in SnappyData - Ad Analytics code example. Here is a screencast that showcases many useful features of SnappyData. The example also goes through a benchmark comparing SnappyData to a Hybrid in-memory database and Cassandra.

Contributing to SnappyData

If you are interested in contributing, please visit the community page for ways in which you can help.