SANSA-Stack / Archived-SANSA-Query

SANSA Query Layer
Apache License 2.0
31 stars 13 forks source link
distributed-computing flink partitioning rdf spark sparql

Archived Repository - Do not use this repository anymore!

SANSA got easier to use! All its code has been consolidated into a single repository at https://github.com/SANSA-Stack/SANSA-Stack

SANSA Query

Maven Central Build Status License Twitter

Description

SANSA Query is a library to perform SPARQL queries over RDF data using big data engines Spark and Flink. It allows to query RDF data that resides both in HDFS and in a local file system. Queries are executed distributed and in parallel across Spark RDDs/DataFrames or Flink DataSets. Further, SANSA-Query can query non-RDF data stored in databases e.g., MongoDB, Cassandra, MySQL or file format Parquet, using Spark.

For RDF data, SANSA uses vertical partitioning (VP) approach and is designed to support extensible partitioning of RDF data. Instead of dealing with a single triple table (s, p, o), data is partitioned into multiple tables based on the used RDF predicates, RDF term types and literal datatypes. The first column of these tables is always a string representing the subject. The second column always represents the literal value as a Scala/Java datatype. Tables for storing literals with language tags have an additional third string column for the language tag. Its uses Sparqlify as a scalable SPARQL-SQL rewriter.

For heterogeneous data sources (data lake), SANSA uses virtual property tables (PT) partitioning, whereby data relevant to a query is loaded on the fly into Spark DataFrames composed of attributes corresponding to the properties of the query.

SANSA Query SPARK - RDF

On SANSA Query Spark for RDF the method for partitioning an RDD[Triple] is located in RdfPartitionUtilsSpark. It uses an RdfPartitioner which maps a Triple to a single RdfPartition instance.

See the available layouts for details.

SANSA Query SPARK - Heterogeneous Data Sources

SANSA Query Spark for heterogeneous data sources (data data) is composed of three main components:

Usage

The following Scala code shows how to query an RDF file SPARQL syntax (be it a local file or a file residing in HDFS):


val spark: SparkSession = ...

val lang = Lang.NTRIPLES
val triples = spark.rdf(lang)("path/to/rdf.nt")

val partitions = RdfPartitionUtilsSpark.partitionGraph(triples)
val rewriter = SparqlifyUtils3.createSparqlSqlRewriter(spark, partitions)

val qef = new QueryExecutionFactorySparqlifySpark(spark, rewriter)

val port = 7531
val server = FactoryBeanSparqlServer.newInstance.setSparqlServiceFactory(qef).setPort(port).create()
server.join()

An overview is given in the FAQ section of the SANSA project page. Further documentation about the builder objects can also be found on the ScalaDoc page.

For querying heterogeneous data sources, refer to the documentation of the dedicated SANSA-DatLake component.

How to Contribute

We always welcome new contributors to the project! Please see our contribution guide for more details on how to get started contributing to SANSA.