[help wanted, question] Inline nested / constructors function calls

tomercagan commented 5 years ago

Hi,

First off, thanks for providing this framework - it is very comprehensive, well document and easy to get started with!

Version information: Hive 1.2.1000.2.6.2.0-205 Hadoop 2.7.3.2.6.2.0-205 (I believe it's install as part of HDP distributon hdp/2.6.2.0-205/) I clone the sample and using the jars from the lib folder so I assume it is latest.

I am trying to do spatial aggregation where I need to get the number of points within pre-defined areas.

For the sake of this discussion, lets assume I have an areas table:

id Int
name String
shape binary

To which I load geo-JSON data using the serdes provided.

CREATE EXTERNAL TABLE IF NOT EXISTS areas(id Int, name String, ShapeBoundary binary)                             
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.GeoJsonSerDe' 
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedGeoJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

LOAD DATA INPATH 'areas.geojson' OVERWRITE INTO TABLE areas;

Next, I have a points table:

id Int
name String
long Double
lat Double

This table is created as a results of a pyspark job with some custom algorithm.

What I am trying to do is aggregate to get the number of points per area*:

SELECT area_id, count(*) FROM(
    SELECT a.id area_id, p.id
    FROM areas a join points p
    WHERE ST_Contains(a.shape, ST_SetSRID(ST_Point(p.long, p.lat), 4326))
) as counts
GROUP BY area_id

`* This queries is a simplified version. In my full query, I am using binning in the join to narrow the search space. It works nicely, giving me a two fold improvement

This fails**. I've tried many variations but if all give me some variation on the log below. Generally, I get variations on

Caused by: java.lang.RuntimeException: Unhandled object type binary at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setOutputCol(VectorUDFAdaptor.java:334) at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:210) at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:154) at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:121) at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:117)

sometimes with the following as well:

Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating returns true if _col3 contains munidb.st_point(_col22, _col23) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecordVector(ReduceRecordSource.java:390) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:232) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:266) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150) ** on some occasions it will work fine - maybe when the data is small or for fixed values (e.g. ST_Point(36.899,27.23452))

On the other hand, when I create the point in the table definition, i.e. I make another step / table which includes a point:

id Int
name String
long Double
lat Double
shape binary

CREATE TABLE points2 
AS SELECT id, name, long, lat, ST_SetSRID(ST_Point(p.long, p.lat), 4326) from points;

The following query works fine

SELECT area_id, count(*) FROM(
    SELECT a.id area_id, p.id
    FROM areas a join points2 p
    WHERE ST_Contains(a.shape, p.shape)
) as counts
GROUP BY area_id

Am I doing something wrong in the first approach or is it not OK to call the constructor functions inline?

error.txt

climbage commented 5 years ago

@tomercagan sorry for the delayed response - just wanted to let you know what I'm doing a bit of research to better understand your issue.

climbage commented 5 years ago

It looks like Hive is attempting to vectorize the binary output of st_point? Can you try setting hive.vectorized.execution.enabled to false and see if you get the same error?

randallwhitman commented 5 years ago

The sample in GIS-Tools-for-Hadoop in fact has a constructor in the where clause. Which goes to say, ST-Geometry constructor in a where clause should be fine, and has been done plenty of times.

tomercagan commented 5 years ago

@climbage - you nailed it - I was about to update - the issue indeed has to do with vectorization and setting hive.vectorized.execution.enabled to false does the trick!

@randallwhitman - you are correct but I was not able to get it to work so I was wondering about it...

Thanks for your time and comments!

Esri / spatial-framework-for-hadoop

[help wanted, question] Inline nested / constructors function calls #157