arangodb / arangodb-spark-connector

Apache License 2.0
33 stars 15 forks source link

what's the common need to use spark #13

Closed codemayq closed 2 years ago

codemayq commented 5 years ago

Thanks for your team's hard work. ArangoDb is one of the most useful software I have ever used.

But I have some issues that may need the team's response. In what kind of product scenes, we should load data from arangodb into spark?

One possible scene may be that we need to do some aggregate on a large scale data, we can use distribute spark to speed up the calculation, because arangodb is not fit to do some kind of heavy jobs, like "group by" and "aggregate". But the problem is that we also need to load a large scale data before we calculate ,so network and disk io may be a large cost.

zawlazaw commented 5 years ago

In our company, we currently evaluate to use ArangoDb as our primary analytics datastore for persisting incoming events and data enrichments, and for a time-evolving graph database. Beside that, we plan to use Spark for regular batch jobs such as reporting, data quality assurance, model training, etc. This is not only because of Spark's typesafe, comfortable and fast distributed processing, but also because it can persist intermediate stages during a job, and it can easily write the result to various sinks (e.g., complex aggregated information can be written in a star schema or as a contingency table to Postgres for later interactive dashboards via Tableau or Metabase).

However, this indeed requires to copy all data (at least the relevant parts after filtering) from ArangoDb to Spark. But so far I believe that this is acceptable, and its cost will roughly be comparable to one Spark-shuffle operation. It would be interesting, though, how one could ensure that RDD-partitions are created on the same physical machines as the corresponding ArangoDb-shards in order to avoid network traffic when initially creating the RDD.

Before the current approach, we first tried to avoid this data copy by choosing MongoDb with its nice monadic-like aggregation pipeline that can run advanced computations in-database. But it turned out that our advanced aggregations are much simpler to implement in Spark, rather than implementing them via MongoDb's aggregation pipeline. So we hopefully will get happy with separating concerns by ArangoDb (for storage, ad-hoc-data-inspection, graphs) and Spark (for complex batch processing).

rashtao commented 2 years ago

This library has been now deprecated in favor of the new ArangoDB Datasource for Apache Spark. If this issue is still relevant please reopen it in the new connector's project.