Consider Apache Beam? - Githubissues

ghost commented 6 years ago

Apache Beam is a unified interface for batch and streaming with multiple backends: Apache Spark, Apache Flink, Apache Apex, Apache Gearpump and Google Cloud Dataflow.

It could deduplicate codes for different backends. Apache Zeppelin also supports Beam interpreter.

GezimSejdiu commented 6 years ago

Hi @t3476 , many thanks for the suggestion. Yes, indeed we have considered Apache Beam after it came out on the end of 2016th.

Since we are providing the same functionality (or at least we try to keep them aligned as much as we can) on both engines (Apache Spark and Apache Flink) we were also thinking of having an underlying engine which works on both runners with a single pipeline. But the good news was that Apache Beam came out and we were looking at it and see if we can adapt it to our framework as well.

We are still discussing the possibility of switching to Apache Beam, and looking forward to seeing if Apache Beam supports Scala API as well; since most of our code is build on Scala language.

We will have to see how easy is to re-implement the same functionality we did on SANSA on Apache Beam framework.

We will keep posting here in case we decide for something. But would be great to keep this option open and discuss the benefits of using one engine which covers most of the distributing frameworks out there.

Best,

ghost commented 6 years ago

Hi @GezimSejdiu, Have you checked scio, scala api of Apache Beam. Its features is very rich and the example looks exactly like Apache Spark's one. I hope this would increase the possibility.

JensLehmann commented 6 years ago

We might consider this for the SANSA 0.5 release (December 2018). It's under discussion until then.

JensLehmann commented 5 years ago

We finally decided not to support this in the 0.5 release of SANSA.

Aklakan commented 5 years ago

Because Linux tooling is often much faster than dedicated Big Data frameworks for most conventional workloads (several GB of data), it may be intriguing whether performance of some data processing workflows could be maximized if they were written in Beam and run with a "SystemEnvironmentRunner`, which uses Linux tooling and pipes.

So running a workflow (pseudocode) such as

PCollection.from("someFile.ttl")
  .apply(RDFReader.readTriplesFrom(Lang.TURTLE))
  .apply(Sorter.create())
  .apply(TextIO.write().to("outputFile.nt"))

should give:

cat someFile | rapper -i ttl -o ntriples - http://foo | sort -u > outputFile.nt

In principle operators for certain common operations would have to be implemented. The idea would be "write once, run everywhere", but in the case of integrating linux tooling, maybe the efforts needed to (a) implement new operators with (b) have roughly portable semantics and (c) actually execute such a workflow would be too high.

Yet, it might be interesting to see, whether beam would in principle allow for doing this.

Validation of Triples (rapper, jena)
Conversion of format
Filtering by regex (grep)
Filtering by predicate expression
Sorting (sort)
Join (join command?)

ghost commented 5 years ago

@Aklakan Apache Beam includes Direct Runner, pipelines on your machine.

Aklakan commented 5 years ago

So I have rolled a module which I guess is conceptually related to Apache Bean when it comes to dataset processing:

https://github.com/SmartDataAnalytics/jena-sparql-api/tree/develop/jena-sparql-api-conjure

Workflows are assembled in RDF using static factory classes - just like in Beam. Furthermore, It seems that my terminology can be mapped to it: Executor = Runner, Workflow = Pipeline

My implementation however is native RDF

Aklakan commented 5 years ago

@Aklakan Apache Beam includes Direct Runner, pipelines on your machine.

Cool, need to check that out

SANSA-Stack / Archived-SANSA-Stack

Consider Apache Beam? #1