spark_fdw or provide native Spark integration

citusdata / citus

Distributed PostgreSQL as an extension

https://www.citusdata.com

GNU Affero General Public License v3.0

10.68k stars 671 forks source link

spark_fdw or provide native Spark integration #47

Open ozgune opened 8 years ago

ozgune commented 8 years ago

We could consider writing a spark_fdw (foreign data wrapper) to enable querying data in Spark.

Or we could build a tight integration between Spark and PostgreSQL / Citus. In this scenario, Spark manages distributed roll-ups and PostgreSQL acts as the presentation layer.

ozgune commented 7 years ago

We heard about providing native Spark integration with Citus from two customers (+2).

I'm also copying/pasting a conversation from https://github.com/citusdata/citus_docs/issues/209 as feedback on Spark integration to this comment.

I'm using Citus together with Slurm and Spark (about 400 compute cores in total). So I have many parallel SELECT and INSERT/UPDATE queries and many concurrent transactions. Most of my tables are relatively small, but some have > 100 million entries. I would like to stick to PostgreSQL for this data instead of using Cassandra or so because I need to perform spatial queries (for which I usePostgis). My goal is to horizontally scale-out INSERT and SELECT statements. It seems that I'm able to achieve this with Citus. However, I have to give up FOREIGN KEY constraints and thus suffer in terms of data integrity (e.g. I can no longer do DELETE ON CASCADE, but need to implement this logic application side).

dimon222 commented 7 years ago

Is this idea now collecting dust? Still looking for options of using Spark efficiently with Citus.

metdos commented 7 years ago

@dimon222, could you a bit explain what kind of integration are you looking for, and what are you trying to accomplish?

Here is a way to use Citus with Spark; https://github.com/koeninger/spark-citus

tommichiels commented 6 years ago

Is there any plan to create a custom spark datasource implementation. We currently use the spark-citus. -> but this only works for writing data not reading.

In should be really cool if spark can just load the citus shards in rdd partitions do some transformations maybe ML things and the store the results back. -> even maybe upsert.

Rakshitha-BR commented 5 years ago

Has anyone tried to load citus shards parallel using spark?

vidhyaguru commented 3 years ago

Is there any plan to create a custom spark datasource implementation. We currently use the spark-citus. -> but this only works for writing data not reading.

In should be really cool if spark can just load the citus shards in rdd partitions do some transformations maybe ML things and the store the results back. -> even maybe upsert.

Has anyone explored this. We are also using spark and looking for above feature which be of great value.