Open ghost opened 6 years ago
Hi @t3476 , many thanks for the suggestion. Yes, indeed we have considered Apache Beam after it came out on the end of 2016th.
Since we are providing the same functionality (or at least we try to keep them aligned as much as we can) on both engines (Apache Spark and Apache Flink) we were also thinking of having an underlying engine which works on both runners with a single pipeline. But the good news was that Apache Beam came out and we were looking at it and see if we can adapt it to our framework as well.
We are still discussing the possibility of switching to Apache Beam, and looking forward to seeing if Apache Beam supports Scala API as well; since most of our code is build on Scala language.
We will have to see how easy is to re-implement the same functionality we did on SANSA on Apache Beam framework.
We will keep posting here in case we decide for something. But would be great to keep this option open and discuss the benefits of using one engine which covers most of the distributing frameworks out there.
Best,
We might consider this for the SANSA 0.5 release (December 2018). It's under discussion until then.
We finally decided not to support this in the 0.5 release of SANSA.
Because Linux tooling is often much faster than dedicated Big Data frameworks for most conventional workloads (several GB of data), it may be intriguing whether performance of some data processing workflows could be maximized if they were written in Beam and run with a "SystemEnvironmentRunner`, which uses Linux tooling and pipes.
So running a workflow (pseudocode) such as
PCollection.from("someFile.ttl")
.apply(RDFReader.readTriplesFrom(Lang.TURTLE))
.apply(Sorter.create())
.apply(TextIO.write().to("outputFile.nt"))
should give:
cat someFile | rapper -i ttl -o ntriples - http://foo | sort -u > outputFile.nt
In principle operators for certain common operations would have to be implemented. The idea would be "write once, run everywhere", but in the case of integrating linux tooling, maybe the efforts needed to (a) implement new operators with (b) have roughly portable semantics and (c) actually execute such a workflow would be too high.
Yet, it might be interesting to see, whether beam would in principle allow for doing this.
@Aklakan Apache Beam includes Direct Runner, pipelines on your machine.
So I have rolled a module which I guess is conceptually related to Apache Bean when it comes to dataset processing:
https://github.com/SmartDataAnalytics/jena-sparql-api/tree/develop/jena-sparql-api-conjure
Workflows are assembled in RDF using static factory classes - just like in Beam. Furthermore, It seems that my terminology can be mapped to it: Executor = Runner, Workflow = Pipeline
My implementation however is native RDF
@Aklakan Apache Beam includes Direct Runner, pipelines on your machine.
Cool, need to check that out
Apache Beam is a unified interface for batch and streaming with multiple backends: Apache Spark, Apache Flink, Apache Apex, Apache Gearpump and Google Cloud Dataflow.
It could deduplicate codes for different backends. Apache Zeppelin also supports Beam interpreter.