Nuttymoon / nifi-hive3streaming-fixed

A NiFi bundle containing a stable implementation of the PutHive3Streaming processor
Apache License 2.0
2 stars 1 forks source link

Timestamp and Dates insertion doesn't work when inserting into Hive transactional table stored as ORC #3

Open TheSeptembre opened 3 years ago

TheSeptembre commented 3 years ago

Hello, We're encountering some issues with the Hive3Streaming processor on Hive 3.0 where NiFi goes into OutOfMemory due to the known memory leaks. However, the Hive3StreamingFixed solved memory leaks (yay) but doesn't work with dates and timestamps format. Our tables have the following format :

PARTITIONED BY ( 
  `field1` string, 
  `field2` string, 
  `field3` string)
CLUSTERED BY (field4) 
INTO 32 BUCKETS
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://path/to/storage'
TBLPROPERTIES (
  'transactional'='true', 'NO_AUTO_COMPACTION'='true');

The issue corrupts the whole table (not a valid ORC file when querying with SQL client) and it says in the logs:

org.apache.hive.streaming.StreamingIOFailure: Encountered errors while closing (see logs) [2020, 10, 9] writeIds[2,2]
        at org.apache.hive.streaming.AbstractRecordWriterFixed.close(AbstractRecordWriterFixed.java:391)
        at org.apache.hive.streaming.HiveStreamingConnectionFixed$TransactionBatch.closeImpl(HiveStreamingConnectionFixed.java:1006)
        at org.apache.hive.streaming.HiveStreamingConnectionFixed$TransactionBatch.close(HiveStreamingConnectionFixed.java:997)
        at org.apache.hive.streaming.HiveStreamingConnectionFixed$TransactionBatch.markDead(HiveStreamingConnectionFixed.java:860)
        at org.apache.hive.streaming.HiveStreamingConnectionFixed$TransactionBatch.write(HiveStreamingConnectionFixed.java:841)
        at org.apache.hive.streaming.HiveStreamingConnectionFixed.write(HiveStreamingConnectionFixed.java:551)
        at org.apache.nifi.processors.hive.PutHive3StreamingFixed.onTrigger(PutHive3StreamingFixed.java:432)
        at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
        at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1162)
        at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:205)
        at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

We know the issue comes from timestamp and dates because if we remove them from the incoming Flowfile, it works without error and solves the memory leaks too.

Thank you for reading.

piedsvelus commented 3 years ago

We facing exactly the same issue with Hive 3.1.0 and NiFi 1.9.0.3.4

We validated with load tests that the PutHive3StreamingFixed processor fixes the memory leak but we failed to use it when the hive table contains timestamp or date colunms.

We discovered that the deltas files are created on the hdfs folders but SQL queries over the table fails, the partition seems corrupted

2020-10-09 15:03:13,986 INFO [Timer-Driven Process Thread-3] o.a.h.s.AbstractRecordWriterFixed Closing updater for partitions: [2020, 10, 9]
2020-10-09 15:03:13,986 ERROR [Timer-Driven Process Thread-3] o.a.h.s.AbstractRecordWriterFixed Unable to close org.apache.hadoop.hive.ql.io.orc.OrcRecordUpdater[hdfs://xxx/yyy/zzz/data/hive/www.db/ttt/year=2020/month=10/day=9/delta_0000002_0000002/bucket_00000] due to: null
java.lang.NullPointerException: null
        at java.lang.System.arraycopy(Native Method)
        at org.apache.hadoop.io.Text.set(Text.java:225)
        at org.apache.orc.impl.StringRedBlackTree.add(StringRedBlackTree.java:59)
        at org.apache.orc.impl.writer.StringTreeWriter.writeBatch(StringTreeWriter.java:70)
        at org.apache.orc.impl.writer.StructTreeWriter.writeFields(StructTreeWriter.java:64)
        at org.apache.orc.impl.writer.StructTreeWriter.writeBatch(StructTreeWriter.java:78)
        at org.apache.orc.impl.writer.StructTreeWriter.writeRootBatch(StructTreeWriter.java:56)
        at org.apache.orc.impl.WriterImpl.addRowBatch(WriterImpl.java:556)
        at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushInternalBatch(WriterImpl.java:297)
        at org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:334)
        at org.apache.hadoop.hive.ql.io.orc.OrcRecordUpdater.close(OrcRecordUpdater.java:557)
        at org.apache.hive.streaming.AbstractRecordWriterFixed.close(AbstractRecordWriterFixed.java:368)
        at org.apache.hive.streaming.HiveStreamingConnectionFixed$TransactionBatch.closeImpl(HiveStreamingConnectionFixed.java:1006)
        at org.apache.hive.streaming.HiveStreamingConnectionFixed$TransactionBatch.close(HiveStreamingConnectionFixed.java:997)
        at org.apache.hive.streaming.HiveStreamingConnectionFixed$TransactionBatch.markDead(HiveStreamingConnectionFixed.java:860)
        at org.apache.hive.streaming.HiveStreamingConnectionFixed$TransactionBatch.write(HiveStreamingConnectionFixed.java:841)
        at org.apache.hive.streaming.HiveStreamingConnectionFixed.write(HiveStreamingConnectionFixed.java:551)
        at org.apache.nifi.processors.hive.PutHive3StreamingFixed.onTrigger(PutHive3StreamingFixed.java:432)
        at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
        at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1162)
        at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:205)
        at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2020-10-09 15:03:14,007 ERROR [Timer-Driven Process Thread-3] o.a.h.s.HiveStreamingConnectionFixed Fatal error on TxnId/WriteIds=[1670528/2...1670528/2] on connection = { metaStoreUri: thrift://blabla, database: www, table: ttt };  TxnStatus[A] LastUsed txnid:1670528; cause Encountered errors while closing (see logs) [2020, 10, 9] writeIds[2,2]
org.apache.hive.streaming.StreamingIOFailure: Encountered errors while closing (see logs) [2020, 10, 9] writeIds[2,2]
        at org.apache.hive.streaming.AbstractRecordWriterFixed.close(AbstractRecordWriterFixed.java:391)
        at org.apache.hive.streaming.HiveStreamingConnectionFixed$TransactionBatch.closeImpl(HiveStreamingConnectionFixed.java:1006)
        at org.apache.hive.streaming.HiveStreamingConnectionFixed$TransactionBatch.close(HiveStreamingConnectionFixed.java:997)
        at org.apache.hive.streaming.HiveStreamingConnectionFixed$TransactionBatch.markDead(HiveStreamingConnectionFixed.java:860)
        at org.apache.hive.streaming.HiveStreamingConnectionFixed$TransactionBatch.write(HiveStreamingConnectionFixed.java:841)
        at org.apache.hive.streaming.HiveStreamingConnectionFixed.write(HiveStreamingConnectionFixed.java:551)
        at org.apache.nifi.processors.hive.PutHive3StreamingFixed.onTrigger(PutHive3StreamingFixed.java:432)
        at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
        at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1162)
        at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:205)
        at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2020-10-09 15:03:14,007 INFO [Timer-Driven Process Thread-3] o.a.h.hive.metastore.HiveMetaStoreClient Closed a connection to metastore, current connections: 1
2020-10-09 15:03:14,007 INFO [Timer-Driven Process Thread-3] o.a.h.hive.metastore.HiveMetaStoreClient Closed a connection to metastore, current connections: 0
2020-10-09 15:03:14,007 INFO [Timer-Driven Process Thread-3] o.a.h.s.HiveStreamingConnectionFixed Closed streaming connection. Agent: NiFi PutHive3StreamingFixed [0d4a737f-0175-1000-ffff-ffffb6b38fee] thread 100[Timer-Driven Process Thread-3] Stats: {records-written: 0, records-size: 0, committed-transactions: 0, aborted-transactions: 0, auto-flushes: 0, metastore-calls: 8 }
2020-10-09 15:03:15,971 INFO [NiFi Web Server-17785] o.a.n.c.s.StandardProcessScheduler Stopping PutHive3StreamingFixed[id=0d4a737f-0175-1000-ffff-ffffb6b38fee]
2020-10-09 15:03:15,971 INFO [NiFi Web Server-17785] o.a.n.controller.StandardProcessorNode Stopping processor: class org.apache.nifi.processors.hive.PutHive3StreamingFixed
2020-10-09 15:03:15,971 INFO [Timer-Driven Process Thread-6] o.a.n.c.s.TimerDrivenSchedulingAgent Stopped scheduling PutHive3StreamingFixed[id=0d4a737f-0175-1000-ffff-ffffb6b38fee] to run