Open ozgune opened 8 years ago
We heard about providing native Spark integration with Citus from two customers (+2).
I'm also copying/pasting a conversation from https://github.com/citusdata/citus_docs/issues/209 as feedback on Spark integration to this comment.
I'm using Citus together with Slurm and Spark (about 400 compute cores in total). So I have many parallel SELECT
and INSERT/UPDATE
queries and many concurrent transactions. Most of my tables are relatively small, but some have > 100 million entries. I would like to stick to PostgreSQL for this data instead of using Cassandra or so because I need to perform spatial queries (for which I usePostgis). My goal is to horizontally scale-out INSERT
and SELECT
statements. It seems that I'm able to achieve this with Citus. However, I have to give up FOREIGN KEY
constraints and thus suffer in terms of data integrity (e.g. I can no longer do DELETE ON CASCADE
, but need to implement this logic application side).
Is this idea now collecting dust? Still looking for options of using Spark efficiently with Citus.
@dimon222, could you a bit explain what kind of integration are you looking for, and what are you trying to accomplish?
Here is a way to use Citus with Spark; https://github.com/koeninger/spark-citus
Is there any plan to create a custom spark datasource implementation. We currently use the spark-citus. -> but this only works for writing data not reading.
In should be really cool if spark can just load the citus shards in rdd partitions do some transformations maybe ML things and the store the results back. -> even maybe upsert.
Has anyone tried to load citus shards parallel using spark?
Is there any plan to create a custom spark datasource implementation. We currently use the spark-citus. -> but this only works for writing data not reading.
In should be really cool if spark can just load the citus shards in rdd partitions do some transformations maybe ML things and the store the results back. -> even maybe upsert.
Has anyone explored this. We are also using spark and looking for above feature which be of great value.
We could consider writing a spark_fdw (foreign data wrapper) to enable querying data in Spark.
Or we could build a tight integration between Spark and PostgreSQL / Citus. In this scenario, Spark manages distributed roll-ups and PostgreSQL acts as the presentation layer.