Merck / Halyard

Halyard is an extremely horizontally scalable Triplestore with support for Named Graphs, designed for integration of extremely large Semantic Data Models, and for storage and SPARQL 1.1 querying of the whole Linked Data universe snapshots.
https://merck.github.io/Halyard
Apache License 2.0
106 stars 18 forks source link

Integrate Halyard with SANSA Stack #71

Open asotona opened 5 years ago

asotona commented 5 years ago

Halyard is powerful distributed triplestore, instantly answering majority of SPARQL queries, however weak in some complex operations (like ORDER BY and GROUP BY) and complicated to implement a custom code that goes beyond SPARQL. SANSA Stack (and similar Spark-based SPARQL frameworks) seem to be complimentary to Halyard - powerful in ordering, aggregations, and easy to integrate custom transformation logic into the pipe, however slow in ad-hoc SPARQL queries, and unable to form SPARQL Endpoint.

The idea is to provide a hybrid solution, where SANSA Stack (or any other Spark framework) can directly use Halyard data and Halyard query engine as a (distributed) source of RDF data for further processing.

  1. Minimal implementation is to provide Halyard library for Spark, so SANSA Stack can directly consume the Halyard data (read the RDF data directly from HBase) and can call Halyard SPARQL Query Engine (consume results from Halyard SPARQL Graph Query locally and directly).
  2. Integrated solution would require to include Halyard as a Service Provider in SANSA SPARQL Query Engine, so hybrid access to Halyard data from Sansa would be available inside SANSA SPARQL as a Federated Service Provider.
  3. Optimal solution would also include transparent integration of Halyard SPARQL parallelization (similar to halyard:forkAndFilterBy function used in Halyard BulkExport), so Spark engine would be able to directly manage Halyard parallelization (transparently for user).

This is an idea of potential synergy effect of Halyard and SANSA Stack, that seems to be worth to test.

peterjohnlawrence commented 4 years ago

Have you taken this idea any further, got any sample code, etc? I am interested in prototyping something, so anything to help me kick start would be useful.

asotona commented 4 years ago

It is just an idea to implement SANSA-compatible Spark library which will include Halyard as node-local SPARQL evaluation engine (not pointing to some central endpoint but actual in-place query evaluation engine directly communicating with HBase, as used in several actual Halyard MapReduce applications). SANSA can then decide to direct the whole queries, fragments of the queries, subqueries or just specific statement patterns to Halyard, based on the knowledge that actual data files are also indexed in Halyard. Any SANSA user can then decide which datasets are worth to index in Halyard and which will be evaluated dynamically by SANSA.

As an example let's have some custom frequently changing data and you want to analyze them with help of for example DBpedia. Actual Halyard requires to index both first (and the rapidly changing data also to periodically update), then run the queries and if your analytics requires something outside of SPARQL - you are probably doomed. Actual SANSA will on the other hand require large computation power and it will full-scan all DBpedia data whenever any query would just touch them. Hybrid solution would allow to index DBpedia in Halyard and SANSA would delegate related queries, subqueries or statement pattern requests to the embedded Halyard library, which will get the data directly from HBase and evaluate them using actual Spark computation node resources as inlined Spark function. Later can also actual Halyard MapReduce apps be moved to Spark/SANSA, so even indexing might be integral part.

    1. 2020 v 9:26, Lawrence notifications@github.com:

 Have you taken this idea any further, got any sample code, etc? I am interested in prototyping something, so anything to help me kick start would be useful.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

peterjohnlawrence commented 4 years ago

I'm not familiar with SANZA. It looks like it partitions its RDF vertically unlike Halyard. Because of this, as I understand, it prefers to reload the RDF from the RDF source and partition into predicate tables with subject as key, and object as value. I assume you would not want to reload the Halyard triples simply to repartition them. Therefore, are you suggesting that Halyard be injected into SANZA query planner/executor so that SANZA uses Halyards SPARQL in preference?

There are a few problems that SPARQL does not solve. For example shortest-path requires as a mimimum iterative SPARQL. I am hoping that SANZA SPARK would offer and alternative.