GeoSpark in spark-shell and spark-sql cli

voycey commented 6 years ago

As part of our pipelines we are currently using Presto to handle geospatial queries on large datasets, this is obviously a problem as it lacks fault tolerance. We want to evaluate GeoSpark to do the same however we cant see anything in the documentation that points to being able to use this with spark-sql or spark-submit directly (e.g. not writing Scala programs to query with SQL).

We have tried passing in the GeoSpark jars in Spark Submit (as a spark-sql job) but we get functions not found on any of the ST Functions.

Is there any documentation on how to get this working? an ideal proof of it working would be to do be able to a query such as the following in the spark-sql shell:

SELECT ST_Equals(ST_GeomFromText('LINESTRING(0 0, 10 10)'),
    ST_GeomFromText('LINESTRING(0 0, 5 5, 10 10)'));

GeoSpark version = 1.1.3

Apache Spark version = 2.2

jiayuasu commented 6 years ago

It is not possible to run GeoSpark in spark-sql cli but you can run it in spark-shell.

As stated on Spark website, The Spark SQL CLI is a convenient tool to run the Hive metastore service in local mode and execute queries input from the command line.

To run GeoSpark in spark-shell

tested on GeoSpark 1.1.3 + Spark-2.1, 2.2, 2.3

Make sure org.datasyslab:geospark-sql_2.2:1.1.3 Maven Coordinate matches your Spark version

./spark-shell --packages org.datasyslab:geospark:1.1.3,org.datasyslab:geospark-sql_2.2:1.1.3,oryslab:geospark-viz:1.1.3 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryo.registrator=org.datasyslab.geosparkviz.core.Serde.GeoSparkVizKryoRegistrator

After entering the spark-shell, run the following lines to make sure you can play with GeoSparkSQL

import org.datasyslab.geosparksql.utils.GeoSparkSQLRegistrator
GeoSparkSQLRegistrator.registerAll(spark);
val df = spark.sql("select ST_GeomFromWKT(\"POINT (30 10)\") as col0")
df.show

You should be able to see something similar to this:

scala> import org.datasyslab.geosparksql.utils.GeoSparkSQLRegistrator
import org.datasyslab.geosparksql.utils.GeoSparkSQLRegistrator

scala> GeoSparkSQLRegistrator.registerAll(spark);
18/08/29 19:26:39 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException

scala> val df = spark.sql("select ST_GeomFromWKT(\"POINT (30 10)\") as col0")
df: org.apache.spark.sql.DataFrame = [col0: geometry]

scala> df.show
+-------------+
|         col0|
+-------------+
|POINT (30 10)|
+-------------+

Check whether GeoSpark custom serializer is set

spark.conf.get("spark.serializer")
spark.conf.get("spark.kryo.registrator")

You should be able to see something similar to this:

scala> spark.conf.get("spark.serializer")
res2: String = org.apache.spark.serializer.KryoSerializer

scala> spark.conf.get("spark.kryo.registrator")
res3: String = org.datasyslab.geosparkviz.core.Serde.GeoSparkVizKryoRegistrator

voycey commented 6 years ago

I think in this case then GeoSpark is missing out on a large section of people who want distributed and fault tolerant SQL interfaces to their data.

Presto is almost perfect in its execution of GeoSpatial joins and queries - but only for shorter ad-hoc / analytics queries. We process billions of rows joined through hundreds of millions of polygons and unfortunately its ability to perform long running jobs is lacking for fault tolerance. Geospark seemed like a perfect solution to this.

Not everyone needs (or wants) to use Scala or RDD's, providing some ability for this to allow simple SQL scripts to be sent to Geospark (through an interface such as SparkSQL) and processed would make this a much more flexible tool, it would also mean it could be integrated with things like EMR and Dataproc pretty simply!

Are there any plans to support this in the future?

jiayuasu commented 6 years ago

I think this is easy to fix

According to spark-sql cli source code, it uses

/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver

I just need to write a GeoSparkSQLCLIDriver similar to SparkSQLCLIDriver.

Enter the following GeoSparkSQL registration sentences in GeoSparkSQLCLIDriver

GeoSparkSQLRegistrator.registerAll(sparkSession);

Then run

spark-sql --class org.datasyslab.geosparksql.utils.GeoSparkSQLCLIDriver

I will try to do this over the weekend. Working on other stuff now.

voycey commented 6 years ago

This would be fantastic if you can get it working, we would be happy to do a write up on our experiences if we can get this working - we have a pretty hefty workload we can test out on it and I am sure that Dataproc people will be excited by this development!

voycey commented 6 years ago

I have just looked a bit further into your comments about things you would need to do and have a few comments about our specific use case:

We use Google Dataproc for our Spark Clusters, one of the reasons is for its simplicity in cluster creation and job submission, we can do this via the API by specifying sparksql as the driver but the UI looks as follows:

Most importantly we can submit the job via REST telling it to use the SparkSQL drivers to execute the queries:

{
  "projectId": "digital2go-1186",
  "job": {
    "placement": {
      "clusterName": "cluster-mike-keep"
    },
    "reference": {
      "jobId": "job-8ad9f8c7"
    },
    "sparkSqlJob": {
      "jarFileUris": [
        "file:///home/dan/somejar.jar"
      ],
      "queryList": {
        "queries": [
          "SELECT ST_Equals(ST_GeomFromText('LINESTRING(0 0, 10 10)'),\n\tST_GeomFromText('LINESTRING(0 0, 5 5, 10 10)'));"
        ]
      }
    }
  }
}

So somehow GeoSpark should if at all possible be able to be loaded into the SparkSQL functionality of the cluster - usually this is just a case of passing jars and registering the functions but we didn't have much luck trying that in this case.

If this isn't possible then there is also the possibility of using spark-submit, we use Airflow as our job orchestration tool which has options for both SparkSQL and spark-submit, both of these seem to be the preferred way of running Geospatial SQL Queries directly against Hive data.

jiayuasu commented 6 years ago

I am pretty sure GeoSpark can work with spark-submit. Did you try GeosparkTemplateProject? That is designed for Spark-submit. On Thu, Aug 30, 2018 at 23:33 Dan Voyce notifications@github.com wrote:

I have just looked a bit further into your comments about things you would need to do and have a few comments about our specific use case:

We use Google Dataproc for our Spark Clusters, one of the reasons is for its simplicity in cluster creation and job submission, we can do this via the API by specifying sparksql as the driver but the UI looks as follows:

[image: image] https://user-images.githubusercontent.com/1065098/44896164-95ef5a80-ad3a-11e8-95b9-f8746de934a9.png

Most importantly we can submit the job via REST telling it to use the SparkSQL drivers to execute the queries:

{ "projectId": "digital2go-1186", "job": { "placement": { "clusterName": "cluster-mike-keep" }, "reference": { "jobId": "job-8ad9f8c7" }, "sparkSqlJob": { "jarFileUris": [ "file:///home/dan/somejar.jar" ], "queryList": { "queries": [ "SELECT ST_Equals(ST_GeomFromText('LINESTRING(0 0, 10 10)'),\n\tST_GeomFromText('LINESTRING(0 0, 5 5, 10 10)'));" ] } } } }

So somehow GeoSpark should if at all possible be able to be loaded into the SparkSQL functionality of the cluster - usually this is just a case of passing jars and registering the functions but we didn't have much luck trying that in this case.

If this isn't possible then there is also the possibility of using spark-submit, we use Airflow as our job orchestration tool which has options for both SparkSQL and spark-submit, both of these seem to be the preferred way of running Geospatial SQL Queries directly against Hive data.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DataSystemsLab/GeoSpark/issues/268#issuecomment-417566293, or mute the thread https://github.com/notifications/unsubscribe-auth/AKcRAN6Azifm4UcWJxBP5Orci-yGUa8Lks5uWNi8gaJpZM4WSnCY .

voycey commented 6 years ago

Yes but that is by building a Jar and sending it to spark submit - those steps in between for many use cases make it unfeasible to execute in a pipeline.

spark-submit can also just take in an SQL query and run that as it would if it were spark-sql (I believe...) - this is the use case that is interesting as it cuts out any middle-man processing and build steps - it would mean Geospark could be a distributed SQL processing engine and that would be huge for the community where currently there is nothing that is fault tolerant available for spatial queries.

voycey commented 6 years ago

Hi @jiayuasu - is there any update on this? We have a workflow that is ready to be run on this right now, currently our Presto pipeline is handling this but not being fault tolerant means we cant use Pre-emptible / Spot instances to run this which we would very much like to be able to :)

voycey commented 6 years ago

Yes spark-sql :)

On Fri, 14 Sep. 2018, 00:29 zongsi.zhang, notifications@github.com wrote:

@voycey https://github.com/voycey Hi, do you mean spark-sql? I believe spark-submit can't directly take sql to run.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataSystemsLab/GeoSpark/issues/268#issuecomment-421027265, or mute the thread https://github.com/notifications/unsubscribe-auth/ABBAigO8mCW0Tfyual2kUiHRG9u2HbuHks5uamvOgaJpZM4WSnCY .

netanel246 commented 4 years ago

Hi @jiayuasu, I'm trying to implement the GeoSparkSQLCLIDriver, but the file has a lot of dependencies files(e.g: HiveThriftServer2, SparkSQLEnv, etc.)

There is a better way other than copying them into the GeoSpark project?

jiayuasu commented 4 years ago

@netanel246 Well, I am also not sure about this...But probably you can make a PR or share your code with me, so I can take a look at it

apache / sedona