Esri / spatial-framework-for-hadoop

The Spatial Framework for Hadoop allows developers and data scientists to use the Hadoop data processing system for spatial data analysis.
Apache License 2.0
367 stars 159 forks source link

[help wanted, question] Inline nested / constructors function calls #157

Closed tomercagan closed 5 years ago

tomercagan commented 5 years ago

Hi,

First off, thanks for providing this framework - it is very comprehensive, well document and easy to get started with!

Version information: Hive 1.2.1000.2.6.2.0-205 Hadoop 2.7.3.2.6.2.0-205 (I believe it's install as part of HDP distributon hdp/2.6.2.0-205/) I clone the sample and using the jars from the lib folder so I assume it is latest.

I am trying to do spatial aggregation where I need to get the number of points within pre-defined areas.

For the sake of this discussion, lets assume I have an areas table:

id Int
name String
shape binary

To which I load geo-JSON data using the serdes provided.

CREATE EXTERNAL TABLE IF NOT EXISTS areas(id Int, name String, ShapeBoundary binary)                             
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.GeoJsonSerDe' 
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedGeoJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

LOAD DATA INPATH 'areas.geojson' OVERWRITE INTO TABLE areas;

Next, I have a points table:

id Int
name String
long Double
lat Double

This table is created as a results of a pyspark job with some custom algorithm.

What I am trying to do is aggregate to get the number of points per area*:

SELECT area_id, count(*) FROM(
    SELECT a.id area_id, p.id
    FROM areas a join points p
    WHERE ST_Contains(a.shape, ST_SetSRID(ST_Point(p.long, p.lat), 4326))
) as counts
GROUP BY area_id

`* This queries is a simplified version. In my full query, I am using binning in the join to narrow the search space. It works nicely, giving me a two fold improvement

This fails**. I've tried many variations but if all give me some variation on the log below. Generally, I get variations on

Caused by: java.lang.RuntimeException: Unhandled object type binary at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setOutputCol(VectorUDFAdaptor.java:334) at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:210) at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:154) at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:121) at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:117)

sometimes with the following as well:

Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating returns true if _col3 contains munidb.st_point(_col22, _col23) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecordVector(ReduceRecordSource.java:390) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:232) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:266) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150) ** on some occasions it will work fine - maybe when the data is small or for fixed values (e.g. ST_Point(36.899,27.23452))

On the other hand, when I create the point in the table definition, i.e. I make another step / table which includes a point:

id Int
name String
long Double
lat Double
shape binary
CREATE TABLE points2 
AS SELECT id, name, long, lat, ST_SetSRID(ST_Point(p.long, p.lat), 4326) from points;

The following query works fine

SELECT area_id, count(*) FROM(
    SELECT a.id area_id, p.id
    FROM areas a join points2 p
    WHERE ST_Contains(a.shape, p.shape)
) as counts
GROUP BY area_id

Am I doing something wrong in the first approach or is it not OK to call the constructor functions inline?

error.txt

climbage commented 5 years ago

@tomercagan sorry for the delayed response - just wanted to let you know what I'm doing a bit of research to better understand your issue.

climbage commented 5 years ago

It looks like Hive is attempting to vectorize the binary output of st_point? Can you try setting hive.vectorized.execution.enabled to false and see if you get the same error?

randallwhitman commented 5 years ago

The sample in GIS-Tools-for-Hadoop in fact has a constructor in the where clause. Which goes to say, ST-Geometry constructor in a where clause should be fine, and has been done plenty of times.

tomercagan commented 5 years ago

@climbage - you nailed it - I was about to update - the issue indeed has to do with vectorization and setting hive.vectorized.execution.enabled to false does the trick!

@randallwhitman - you are correct but I was not able to get it to work so I was wondering about it...

Thanks for your time and comments!