Esri / spatial-framework-for-hadoop

The Spatial Framework for Hadoop allows developers and data scientists to use the Hadoop data processing system for spatial data analysis.
Apache License 2.0
367 stars 160 forks source link

Using the spatial framework for hadoop with data stored in ORC files #85

Open dvonck opened 9 years ago

dvonck commented 9 years ago

Good Afternoon,

The ORC format allows for the efficient storage and retrieval of big data files. For more details see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC.

We have installed a Hadoop Cluster based on the Hortonworks Data Platform 2.2.6.0-2800.

When we work with csv files in hive we do not have any problems . When we use the ORC file format we get the following problems.

ORC Problem

[hive@srv-hc10 ~]$ hive

hive> add jar esri-geometry-api-1.2.1.jar spatial-sdk-hive-1.0.3-SNAPSHOT.jar spatial-sdk-json-1.0.3-SNAPSHOT.jar; Added [esri-geometry-api-1.2.1.jar, spatial-sdk-hive-1.0.3-SNAPSHOT.jar, spatial-sdk-json-1.0.3-SNAPSHOT.jar] to class path Added resources: [esri-geometry-api-1.2.1.jar, spatial-sdk-hive-1.0.3-SNAPSHOT.jar, spatial-sdk-json-1.0.3-SNAPSHOT.jar] hive> create temporary function ST_Bin as 'com.esri.hadoop.hive.ST_Bin'; OK Time taken: 0.636 seconds hive> create temporary function ST_BinEnvelope as 'com.esri.hadoop.hive.ST_BinEnvelope'; OK Time taken: 0.014 seconds

hive> describe formatted xxxxxxx.events_orc; OK

col_name data_type comment

vehicle_id int ignition smallint event_ts bigint event_description string longitude double latitude double altitude string speed smallint bearing smallint linear_g double lateral_g double trip_no int

Detailed Table Information

Database: xxxxxxx Owner: root CreateTime: Thu Jun 18 22:41:42 SAST 2015 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://srv-hcm01.esri-southafrica.com:8020/apps/hive/warehouse/xxxxxxx.db/events_orc Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE false auto.purge true comment xxxxxxx analysis table last_modified_by root last_modified_time 1434727038 numFiles 62 numRows -1 orc.compress SNAPPY rawDataSize -1 totalSize 1954173667 transient_lastDdlTime 1434727038

Storage Information

SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Compressed: No Num Buckets: 62 Bucket Columns: [vehicle_id] Sort Columns: [Order(col:event_ts, order:1)] Storage Desc Params: serialization.format 1 Time taken: 1.135 seconds, Fetched: 47 row(s) hive> select ST_Bin(0.001, ST_Point(longitude, latitude)) as binvalue, count(*) as freq

from xxxxxxx.events_orc where longitude is not null and latitude is not null and vehicle_id = 63962497 group by ST_Bin(0.001, ST_Point(longitude, latitude)); Query ID = hive_20150623124949_0461acf6-46d8-41e4-99e1-6b62836abf6a Total jobs = 1 Launching Job 1 out of 1 Tez session was closed. Reopening... Session re-established.

Status: Running (Executing on YARN cluster with App id application_1434395264469_0091)


    VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

Map 1 FAILED 68 0 0 68 153 67

Reducer 2 KILLED 8 0 0 8 0 8

VERTICES: 00/02 [>>--------------------------] 0% ELAPSED TIME: 23.79 s

Status: Failed Vertex failed, vertexName=Map 1, vertexId=vertex_1434395264469_0091_1_00, diagnostics=[Task failed, taskId=task_1434395264469_0091_1_00_000011, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.Error: Cannot allocate vector column for None at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:172) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:176) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.Error: Cannot allocate vector column for None at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.allocateColumnVector(VectorizedRowBatchCtx.java:643) at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.addScratchColumnsToBatch(VectorizedRowBatchCtx.java:606) at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.createVectorizedRowBatch(VectorizedRowBatchCtx.java:339) at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:109) at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:49) at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:58) at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:33) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.createValue(TezGroupedSplitsInputFormat.java:141) at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:150) at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80) at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:609) at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:588) at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:140) at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:361) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:134) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162) ... 13 more ], TaskAttempt 1 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.Error: Cannot allocate vector column for None at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:172) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:176) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.Error: Cannot allocate vector column for None at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.allocateColumnVector(VectorizedRowBatchCtx.java:643) at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.addScratchColumnsToBatch(VectorizedRowBatchCtx.java:606) at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.createVectorizedRowBatch(VectorizedRowBatchCtx.java:339) at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:109) at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:49) at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:58) at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:33) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.createValue(TezGroupedSplitsInputFormat.java:141) at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:150) at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80) at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:609) at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:588) at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:140) at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:361) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:134) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162) ... 13 more ], TaskAttempt 2 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.Error: Cannot allocate vector column for None at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:172) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:176) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.Error: Cannot allocate vector column for None at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.allocateColumnVector(VectorizedRowBatchCtx.java:643) at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.addScratchColumnsToBatch(VectorizedRowBatchCtx.java:606) at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.createVectorizedRowBatch(VectorizedRowBatchCtx.java:339) at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:109) at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:49) at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:58) at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:33) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.createValue(TezGroupedSplitsInputFormat.java:141) at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:150) at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80) at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:609) at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:588) at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:140) at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:361) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:134) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162) ... 13 more ], TaskAttempt 3 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.Error: Cannot allocate vector column for None at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:172) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:176) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.Error: Cannot allocate vector column for None at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.allocateColumnVector(VectorizedRowBatchCtx.java:643) at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.addScratchColumnsToBatch(VectorizedRowBatchCtx.java:606) at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.createVectorizedRowBatch(VectorizedRowBatchCtx.java:339) at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:109) at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:49) at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:58) at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:33) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.createValue(TezGroupedSplitsInputFormat.java:141) at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:150) at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80) at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:609) at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:588) at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:140) at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:361) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:134) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162) ... 13 more ]], Vertex failed as one or more tasks failed. failedTasks:1, Vertex vertex_1434395264469_0091_1_00 [Map 1] killed/failed due to:null] Vertex killed, vertexName=Reducer 2, vertexId=vertex_1434395264469_0091_1_01, diagnostics=[Vertex received Kill while in RUNNING state., Vertex killed as other vertex failed. failedTasks:0, Vertex vertex_1434395264469_0091_1_01 [Reducer 2] killed/failed due to:null] DAG failed due to vertex failure. failedVertices:1 killedVertices:1 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask hive>

Could you please investigate if this is viable.

Regards

Derck

climbage commented 9 years ago

Interesting. We will try to reproduce this. In the meantime, can you disable vectorization to try and get around the error?

set hive.vectorized.execution.enabled = false;

This may affect the performance of the queries.

dvonck commented 9 years ago

Hi Michael

Using set hive.vectorized.execution.enabled = false; and set hive.default.fileformat=TextFile; made the queries work. Having a look at the source code it looks like ORC does not know how to work with the spatial types in columns.

Looking at the code at http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-exec/1.1.0/org/apache/hadoop/hive/ql/exec/vector/VectorizedRowBatchCtx.java#VectorizedRowBatchCtx.allocateColumnVector%28java.lang.String%2Cint%29

630 private ColumnVector More ...allocateColumnVector(String type, int defaultSize) {

631 if (type.equalsIgnoreCase("double")) {

632 return new DoubleColumnVector(defaultSize);

633 } else if (VectorizationContext.isStringFamily(type)) {

634 return new BytesColumnVector(defaultSize);

635 } else if (VectorizationContext.decimalTypePattern.matcher(type).matches()){

636 int [] precisionScale = getScalePrecisionFromDecimalType(type);

637 return new DecimalColumnVector(defaultSize, precisionScale[0], precisionScale[1]);

638 } else if (type.equalsIgnoreCase("long") ||

639 type.equalsIgnoreCase("date") ||

640 type.equalsIgnoreCase("timestamp")) {

641 return new LongColumnVector(defaultSize);

642 } else {

643 throw new Error("Cannot allocate vector column for " + type);

644 }

645 }

646

Thank you very much for your help.

You can close this issue.

Regards

Derck

From: Michael Park [mailto:notifications@github.com] Sent: 23 June 2015 04:37 PM To: Esri/spatial-framework-for-hadoop Cc: Derck Vonck Subject: Re: [spatial-framework-for-hadoop] Using the spatial framework for hadoop with data stored in ORC files (#85)

Interesting. We will try to reproduce this. In the meantime, can you disable vectorization to try and get around the error?

set hive.vectorized.execution.enabled = false;

This may affect the performance of the queries.

— Reply to this email directly or view it on GitHubhttps://github.com/Esri/spatial-framework-for-hadoop/issues/85#issuecomment-114528200.

krishnat2 commented 8 years ago

Hey we are able to run spatial data with ORC Files.

I ran to the same problem as you. After Some research I figured that TEZ Engine uses Vectorization which does not support Binary Datatype. When we compute ST_Point or ST_Polygon the result is binary data. So just disabling vectorization for this step solves your problem

ColeFerrier commented 8 years ago

I don't see this method that is called out on master:

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizedRowBatchCtx.java

do we think this is still a problem on hive-master?

It looks like it was changed in this commit:

https://github.com/apache/hive/commit/30f20e992e05754efc4b984030b01f0184e0359d

then the code in

https://github.com/apache/hive/blame/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizedBatchUtil.java

at some point was updated to include binary support. or it appears that way.

randallwhitman commented 2 years ago

May warrant note in https://github.com/Esri/spatial-framework-for-hadoop/wiki/ST_Geometry-for-Hive-Compatibility-with-Hive-Versions