RADAR-base / radar-output-restructure

Reads avro files in HDFS and outputs json or csv per topic per user in local file system
Apache License 2.0
1 stars 0 forks source link

Records not extracting from specific topics (partitions?) #6

Closed sboettcher closed 7 years ago

sboettcher commented 7 years ago

I am having a weird case of data loss at the moment concerning the extraction tool. Using one patient as an example, here is what I am seeing: Using the tool, I get csv files for topic android_empatica_e4_acceleration (showing just the last 3):

...
-rw-r--r--  1 root root 9,9M 05.08.2017 12:47 20170803_1600.csv
-rw-r--r--  1 root root 9,9M 05.08.2017 12:57 20170803_1700.csv
-rw-r--r--  1 root root 2,7M 05.08.2017 13:07 20170803_1800.csv

The problem is that there is missing data from the following night. The recording lasts until the morning after, which is also reflected in other empatica topics, like e.g. android_empatica_e4_electrodermal_activity:

...
-rw-r--r--  1 root root  1,1M 04.08.2017 19:42 20170804_0400.csv
-rw-r--r--  1 root root  1,1M 04.08.2017 19:42 20170804_0500.csv
-rw-r--r--  1 root root  769K 04.08.2017 19:42 20170804_0600.csv

This is currently the case for the acc and BVP topics.

I know that the here missing data at least went through kafka, since the aggregated data for that timeframe is in the hotstorage:

screenshot from 2017-08-05 13-56-28_cut showing data from the REST-API call get_samples_json, for that particular recording. (Note the date/time x-axis is CEST, not UTC)

Something I noticed in my logfiles from the tool: I have several BlockMissingExceptions from those two topics

$ grep -Hin blockmissingexception restructure_log*
restructure_log_01-08-2017_10-59-03:853493:2017-08-01 08:59:01 WARN  DFSClient:981 - Could not obtain block: BP-1157731097-172.18.0.3-1501484738256:blk_1073809137_68313 file=/topicAndroidNew/android_empatica_e4_blood_volume_pulse/partition=2/android_empatica_e4_blood_volume_pulse+2+0005159476+0005159625.avro No live nodes contain current block Block locations: DatanodeInfoWithStorage[172.18.0.4:50010,DS-f79e7ed4-3ae9-4767-9c53-0f30b6018781,DISK] DatanodeInfoWithStorage[172.18.0.2:50010,DS-026efe85-2e06-4c85-9ee3-6db75ee16e86,DISK] Dead nodes:  DatanodeInfoWithStorage[172.18.0.4:50010,DS-f79e7ed4-3ae9-4767-9c53-0f30b6018781,DISK] DatanodeInfoWithStorage[172.18.0.2:50010,DS-026efe85-2e06-4c85-9ee3-6db75ee16e86,DISK]. Throwing a BlockMissingException
restructure_log_01-08-2017_10-59-03:853494:2017-08-01 08:59:01 WARN  DFSClient:981 - Could not obtain block: BP-1157731097-172.18.0.3-1501484738256:blk_1073809137_68313 file=/topicAndroidNew/android_empatica_e4_blood_volume_pulse/partition=2/android_empatica_e4_blood_volume_pulse+2+0005159476+0005159625.avro No live nodes contain current block Block locations: DatanodeInfoWithStorage[172.18.0.4:50010,DS-f79e7ed4-3ae9-4767-9c53-0f30b6018781,DISK] DatanodeInfoWithStorage[172.18.0.2:50010,DS-026efe85-2e06-4c85-9ee3-6db75ee16e86,DISK] Dead nodes:  DatanodeInfoWithStorage[172.18.0.4:50010,DS-f79e7ed4-3ae9-4767-9c53-0f30b6018781,DISK] DatanodeInfoWithStorage[172.18.0.2:50010,DS-026efe85-2e06-4c85-9ee3-6db75ee16e86,DISK]. Throwing a BlockMissingException
restructure_log_01-08-2017_10-59-03:853496:org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1157731097-172.18.0.3-1501484738256:blk_1073809137_68313 file=/topicAndroidNew/android_empatica_e4_blood_volume_pulse/partition=2/android_empatica_e4_blood_volume_pulse+2+0005159476+0005159625.avro
restructure_log_01-08-2017_10-59-03:853520:Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1157731097-172.18.0.3-1501484738256:blk_1073809137_68313 file=/topicAndroidNew/android_empatica_e4_blood_volume_pulse/partition=2/android_empatica_e4_blood_volume_pulse+2+0005159476+0005159625.avro
restructure_log_02-08-2017_10-45-24:911759:2017-08-02 08:45:18 WARN  DFSClient:981 - Could not obtain block: BP-1157731097-172.18.0.3-1501484738256:blk_1073938426_197602 file=/topicAndroidNew/android_empatica_e4_acceleration/partition=2/android_empatica_e4_acceleration+2+0006341124+0006341273.avro No live nodes contain current block Block locations: DatanodeInfoWithStorage[172.18.0.2:50010,DS-026efe85-2e06-4c85-9ee3-6db75ee16e86,DISK] DatanodeInfoWithStorage[172.18.0.4:50010,DS-f79e7ed4-3ae9-4767-9c53-0f30b6018781,DISK] Dead nodes:  DatanodeInfoWithStorage[172.18.0.4:50010,DS-f79e7ed4-3ae9-4767-9c53-0f30b6018781,DISK] DatanodeInfoWithStorage[172.18.0.2:50010,DS-026efe85-2e06-4c85-9ee3-6db75ee16e86,DISK]. Throwing a BlockMissingException
restructure_log_02-08-2017_10-45-24:911760:2017-08-02 08:45:18 WARN  DFSClient:981 - Could not obtain block: BP-1157731097-172.18.0.3-1501484738256:blk_1073938426_197602 file=/topicAndroidNew/android_empatica_e4_acceleration/partition=2/android_empatica_e4_acceleration+2+0006341124+0006341273.avro No live nodes contain current block Block locations: DatanodeInfoWithStorage[172.18.0.2:50010,DS-026efe85-2e06-4c85-9ee3-6db75ee16e86,DISK] DatanodeInfoWithStorage[172.18.0.4:50010,DS-f79e7ed4-3ae9-4767-9c53-0f30b6018781,DISK] Dead nodes:  DatanodeInfoWithStorage[172.18.0.4:50010,DS-f79e7ed4-3ae9-4767-9c53-0f30b6018781,DISK] DatanodeInfoWithStorage[172.18.0.2:50010,DS-026efe85-2e06-4c85-9ee3-6db75ee16e86,DISK]. Throwing a BlockMissingException
restructure_log_02-08-2017_10-45-24:911762:org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1157731097-172.18.0.3-1501484738256:blk_1073938426_197602 file=/topicAndroidNew/android_empatica_e4_acceleration/partition=2/android_empatica_e4_acceleration+2+0006341124+0006341273.avro
restructure_log_02-08-2017_10-45-24:911786:Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1157731097-172.18.0.3-1501484738256:blk_1073938426_197602 file=/topicAndroidNew/android_empatica_e4_acceleration/partition=2/android_empatica_e4_acceleration+2+0006341124+0006341273.avro

Although I have no idea if that could be the source of the problem. Maybe a particular partition is not extracting, which contains the data?

I am using @fnobilia's script from RADAR-Docker, and have DEBUG enabled in log4j.properties.

sboettcher commented 7 years ago

I had another look into the HDFS file system on the server and I am pretty sure that the missing data is actually in the HDFS. With hdfs dfs -ls I found .avro files in two partitions that were written in that night, and I just picked one of the files and looked into it (hdfs dfs -cat) and I can see the patient ID from that specific recording here data is missing (the actual data is binary obv.). Is there a way to restart the extraction from scratch for just one topic/ID? Maybe by manipulating the bins.csv and/or offsets.csv? (I suppose just deleting the whole topics folder would work too but then everything has to be extracted again.)

Also I did not really find any suspicious log entries from that time in the docker-compose logs for hdfs-datanode-1 hdfs-datanode-2 hdfs-namenode radar-hdfs-connector.

MaximMoinat commented 7 years ago

To restart the extraction for a portion of the data, you can indeed manipulate offsets.csv. Just remove all lines of the acceleration topic (or just the offsets that gave an exception) and run the script again.

Could you try to convert one of the files that gave a BlockMissingException to json with AvroTools?

hadoop fs -get /topicAndroidNew/android_empatica_e4_blood_volume_pulse/partition=2/android_empatica_e4_blood_volume_pulse+2+0005159476+0005159625.avro

java -jar /usr/share/avro-tools/avro-tools-1.7.7.jar tojson android_empatica_e4_blood_volume_pulse+2+0005159476+0005159625.avro
sboettcher commented 7 years ago

tojson works without error message on that file. Will now try to just rerun the extraction on the topic.

sboettcher commented 7 years ago

Edit: I think this is a separate issue, submitted it here: https://github.com/RADAR-CNS/RADAR-HDFS-Sink-Connector/issues/4 Hm so I saw that radar-hdfs-connector had now stopped with an exception:

radar-hdfs-connector_1     | 2017-08-05T12:54:01.133005441Z [2017-08-05 12:54:01,132] ERROR Task radar-hdfs-sink-android-15000-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133085075Z java.lang.NullPointerException
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133092941Z   at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:296)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133100828Z   at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:121)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133108550Z   at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:302)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133115345Z   at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:435)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133121594Z   at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:147)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133129616Z   at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133137065Z   at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133143866Z   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133155380Z   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133161875Z   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133169393Z   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133177736Z   at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133184102Z [2017-08-05 12:54:01,132] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)

so I restarted the container, and got a bunch of exceptions ending again with that one:

log

radar-hdfs-connector_1     | 2017-08-08T10:58:17.658855180Z Exception in thread "kafka-coordinator-heartbeat-thread | connect-radar-hdfs-sink-android-15000" Exception in thread "LeaseRenewer:root@hdfs-namenode:8020" Exception in thread "kafka-coordinator-heartbeat-thread | connect-radar-hdfs-sink-android-15000" java.lang.OutOfMemoryError: Java heap space
radar-hdfs-connector_1     | 2017-08-08T10:58:17.660211421Z     at java.util.HashMap$EntrySet.iterator(HashMap.java:1013)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.660228131Z     at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.trySend(ConsumerNetworkClient.java:386)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.660237971Z     at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:223)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.660330193Z     at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.pollNoWakeup(ConsumerNetworkClient.java:266)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.660342490Z     at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$HeartbeatThread.run(AbstractCoordinator.java:865)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.661818388Z java.lang.OutOfMemoryError: Java heap space
radar-hdfs-connector_1     | 2017-08-08T10:58:17.661917082Z [2017-08-08 10:58:17,659] ERROR Task radar-hdfs-sink-android-15000-3 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.661930189Z java.lang.OutOfMemoryError: Java heap space
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662053284Z     at java.util.ArrayList.iterator(ArrayList.java:834)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662067477Z     at org.apache.hadoop.hdfs.LeaseRenewer.clientsRunning(LeaseRenewer.java:241)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662076405Z     at org.apache.hadoop.hdfs.LeaseRenewer.access$500(LeaseRenewer.java:71)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662530537Z [2017-08-08 10:58:17,660] ERROR Task radar-hdfs-sink-android-15000-2 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662543689Z java.lang.NullPointerException
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662551642Z     at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:296)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662558950Z     at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:121)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662566144Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:302)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662573083Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:435)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662580007Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:147)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662587186Z     at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662593930Z     at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662600773Z     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662607643Z     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662614597Z     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662622157Z     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662640133Z     at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663100448Z [2017-08-08 10:58:17,660] WARN Unexpected error reading responses on connection Thread[IPC Client (248329120) connection to hdfs-namenode/172.18.0.3:8020 from root,5,main] (org.apache.hadoop.ipc.Client)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663114053Z java.lang.OutOfMemoryError: Java heap space
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663122000Z     at java.lang.Integer.valueOf(Integer.java:832)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663129185Z     at sun.nio.ch.EPollSelectorImpl.updateSelectedKeys(EPollSelectorImpl.java:120)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663136544Z     at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:98)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663143873Z     at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663151310Z     at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663158981Z     at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663166127Z     at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663173755Z     at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663181039Z     at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663188887Z     at java.io.FilterInputStream.read(FilterInputStream.java:133)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663196501Z     at java.io.FilterInputStream.read(FilterInputStream.java:133)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663203539Z     at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663210703Z     at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663217825Z     at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663224860Z     at java.io.DataInputStream.readInt(DataInputStream.java:387)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663232252Z     at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663239095Z     at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663245740Z [2017-08-08 10:58:17,662] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663252919Z [2017-08-08 10:58:17,661] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.664754302Z     at org.apache.hadoop.hdfs.LeaseRenewer$Factory.remove(LeaseRenewer.java:154)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.664857528Z     at org.apache.hadoop.hdfs.LeaseRenewer$Factory.access$800(LeaseRenewer.java:90)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.664878000Z     at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:312)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.664966845Z     at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.665451043Z java.lang.OutOfMemoryError: Java heap space
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674685502Z [2017-08-08 10:58:17,672] ERROR Recovery failed at state RECOVERY_PARTITION_PAUSED (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674757976Z org.apache.kafka.connect.errors.ConnectException: java.io.IOException: Failed on local exception: java.io.IOException: Error reading responses; Host Details : local host is: "6d3d8843d9b9/172.18.0.5"; destination host is: "hdfs-namenode":8020; 
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674775466Z     at io.confluent.connect.hdfs.wal.FSWAL.apply(FSWAL.java:131)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674786782Z     at io.confluent.connect.hdfs.TopicPartitionWriter.applyWAL(TopicPartitionWriter.java:484)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674797842Z     at io.confluent.connect.hdfs.TopicPartitionWriter.recover(TopicPartitionWriter.java:212)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674809804Z     at io.confluent.connect.hdfs.DataWriter.recover(DataWriter.java:239)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674818974Z     at io.confluent.connect.hdfs.DataWriter.open(DataWriter.java:281)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674828847Z     at io.confluent.connect.hdfs.HdfsSinkTask.open(HdfsSinkTask.java:116)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674838684Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:431)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674848146Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.access$1000(WorkerSinkTask.java:55)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674904030Z     at org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsAssigned(WorkerSinkTask.java:467)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674915370Z     at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:228)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674925666Z     at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:313)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674935125Z     at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:277)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674944293Z     at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:259)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674953437Z     at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1013)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674962779Z     at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:979)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674972112Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:317)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674987616Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:235)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675016059Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:172)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675031307Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:143)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675040607Z     at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675049323Z     at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675057743Z     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675082478Z     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675091712Z     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675100571Z     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675115445Z     at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675481141Z Caused by: java.io.IOException: Failed on local exception: java.io.IOException: Error reading responses; Host Details : local host is: "6d3d8843d9b9/172.18.0.5"; destination host is: "hdfs-namenode":8020; 
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675522787Z     at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675531882Z     at org.apache.hadoop.ipc.Client.call(Client.java:1473)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675540120Z     at org.apache.hadoop.ipc.Client.call(Client.java:1400)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675548463Z     at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675557739Z     at com.sun.proxy.$Proxy47.getFileInfo(Unknown Source)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675566079Z     at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675574527Z     at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675582693Z     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675590784Z     at java.lang.reflect.Method.invoke(Method.java:498)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675598809Z     at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675607116Z     at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675616695Z     at com.sun.proxy.$Proxy48.getFileInfo(Unknown Source)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675624462Z     at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1977)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675632361Z     at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675640551Z     at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675648690Z     at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675669801Z     at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675678825Z     at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1400)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675687171Z     at io.confluent.connect.hdfs.storage.HdfsStorage.exists(HdfsStorage.java:66)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675695703Z     at io.confluent.connect.hdfs.wal.FSWAL.apply(FSWAL.java:120)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675703914Z     ... 25 more
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675733556Z Caused by: java.io.IOException: Error reading responses
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675742644Z     at org.apache.hadoop.ipc.Client$Connection.run(Client.java:974)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675750558Z Caused by: java.lang.OutOfMemoryError: Java heap space
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675758148Z     at java.lang.Integer.valueOf(Integer.java:832)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675766979Z     at sun.nio.ch.EPollSelectorImpl.updateSelectedKeys(EPollSelectorImpl.java:120)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675826511Z     at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:98)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675846692Z     at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675854309Z     at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675861362Z     at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675868688Z     at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675875810Z     at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675882983Z     at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675890033Z     at java.io.FilterInputStream.read(FilterInputStream.java:133)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675896759Z     at java.io.FilterInputStream.read(FilterInputStream.java:133)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675912188Z     at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675919586Z     at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675926531Z     at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675933227Z     at java.io.DataInputStream.readInt(DataInputStream.java:387)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675940313Z     at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675947123Z     at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675953636Z [2017-08-08 10:58:17,674] INFO Started recovery for topic partition application_uptime-0 (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.759939024Z [2017-08-08 10:58:17,759] INFO Discovered coordinator kafka-2:9092 (id: 2147483645 rack: null) for group connect-radar-hdfs-sink-android-15000. (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
radar-hdfs-connector_1     | 2017-08-08T10:58:18.790960667Z [2017-08-08 10:58:18,790] INFO Successfully acquired lease for hdfs://hdfs-namenode:8020/logs/application_uptime/0/log (io.confluent.connect.hdfs.wal.FSWAL)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.222620804Z [2017-08-08 10:58:21,222] WARN DataStreamer Exception (org.apache.hadoop.hdfs.DFSClient)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.222711059Z java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[172.18.0.4:50010, 172.18.0.2:50010], original=[172.18.0.4:50010, 172.18.0.2:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
radar-hdfs-connector_1     | 2017-08-08T10:58:21.222761069Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.222770081Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.222778601Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.222786062Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.223887738Z [2017-08-08 10:58:21,222] ERROR Recovery failed at state WAL_APPLIED (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224044328Z org.apache.kafka.connect.errors.ConnectException: Error closing hdfs://hdfs-namenode:8020/logs/application_uptime/0/log
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224213471Z     at io.confluent.connect.hdfs.wal.FSWAL.close(FSWAL.java:161)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224240225Z     at io.confluent.connect.hdfs.wal.FSWAL.truncate(FSWAL.java:143)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224315470Z     at io.confluent.connect.hdfs.TopicPartitionWriter.truncateWAL(TopicPartitionWriter.java:490)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224339287Z     at io.confluent.connect.hdfs.TopicPartitionWriter.recover(TopicPartitionWriter.java:215)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224405665Z     at io.confluent.connect.hdfs.DataWriter.recover(DataWriter.java:239)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224429416Z     at io.confluent.connect.hdfs.DataWriter.open(DataWriter.java:281)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224496283Z     at io.confluent.connect.hdfs.HdfsSinkTask.open(HdfsSinkTask.java:116)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224517864Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:431)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224668545Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.access$1000(WorkerSinkTask.java:55)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224702481Z     at org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsAssigned(WorkerSinkTask.java:467)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224825771Z     at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:228)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224863913Z     at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:313)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224968463Z     at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:277)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225007628Z     at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:259)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225115954Z     at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1013)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225144781Z     at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:979)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225238780Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:317)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225334044Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:235)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225385164Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:172)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225481965Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:143)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225539996Z     at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225635045Z     at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225662942Z     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225752911Z     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225781015Z     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225889033Z     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225918844Z     at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.226067886Z Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[172.18.0.4:50010, 172.18.0.2:50010], original=[172.18.0.4:50010, 172.18.0.2:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
radar-hdfs-connector_1     | 2017-08-08T10:58:21.226090089Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.226098765Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.226105605Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.226112853Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.226119549Z [2017-08-08 10:58:21,223] INFO Started recovery for topic partition android_biovotion_battery_state-0 (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.227944280Z [2017-08-08 10:58:21,227] INFO Finished recovery for topic partition android_biovotion_battery_state-0 (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.227960824Z [2017-08-08 10:58:21,227] INFO Started recovery for topic partition android_biovotion_spo2-0 (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.242479443Z [2017-08-08 10:58:21,242] INFO Finished recovery for topic partition android_biovotion_spo2-0 (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.242519511Z [2017-08-08 10:58:21,242] INFO Started recovery for topic partition android_empatica_e4_blood_volume_pulse-0 (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594051287Z [2017-08-08 10:58:43,593] ERROR Task radar-hdfs-sink-android-15000-1 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594178139Z java.lang.NullPointerException
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594321199Z     at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:296)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594369256Z     at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:121)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594435121Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:302)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594458350Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:435)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594512221Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:147)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594533581Z     at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594585291Z     at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594606048Z     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594667725Z     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594689178Z     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594744402Z     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594765654Z     at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594815858Z [2017-08-08 10:58:43,593] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.680815652Z Exception in thread "LeaseRenewer:root@hdfs-namenode:8020" java.lang.OutOfMemoryError: Java heap space
radar-hdfs-connector_1     | 2017-08-08T10:59:09.684064012Z [2017-08-08 10:59:09,683] WARN DataStreamer Exception (org.apache.hadoop.hdfs.DFSClient)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.684121728Z java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[172.18.0.2:50010, 172.18.0.4:50010], original=[172.18.0.2:50010, 172.18.0.4:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
radar-hdfs-connector_1     | 2017-08-08T10:59:09.684134622Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.684142125Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.684149445Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.684156588Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685040766Z [2017-08-08 10:59:09,684] ERROR Error closing hdfs://hdfs-namenode:8020/logs/android_empatica_e4_acceleration/0/log. (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685060542Z org.apache.kafka.connect.errors.ConnectException: Error closing hdfs://hdfs-namenode:8020/logs/android_empatica_e4_acceleration/0/log
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685069086Z     at io.confluent.connect.hdfs.wal.FSWAL.close(FSWAL.java:161)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685096253Z     at io.confluent.connect.hdfs.TopicPartitionWriter.close(TopicPartitionWriter.java:372)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685104128Z     at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:296)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685114217Z     at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:121)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685121639Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:302)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685128947Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:435)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685135976Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:147)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685143131Z     at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685150382Z     at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685156655Z     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685163099Z     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685169246Z     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685175843Z     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685182428Z     at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685188941Z Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[172.18.0.2:50010, 172.18.0.4:50010], original=[172.18.0.2:50010, 172.18.0.4:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685197143Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685203794Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685211285Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685218128Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685886612Z [2017-08-08 10:59:09,685] ERROR Error closing writer for android_empatica_e4_acceleration-0. Error: Error closing writer: Error closing hdfs://hdfs-namenode:8020/logs/android_empatica_e4_acceleration/0/log
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685900880Z  (io.confluent.connect.hdfs.DataWriter)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686436108Z [2017-08-08 10:59:09,685] ERROR Error closing hdfs://hdfs-namenode:8020/logs/application_uptime/0/log. (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686449148Z org.apache.kafka.connect.errors.ConnectException: Error closing hdfs://hdfs-namenode:8020/logs/application_uptime/0/log
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686457044Z     at io.confluent.connect.hdfs.wal.FSWAL.close(FSWAL.java:161)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686470374Z     at io.confluent.connect.hdfs.TopicPartitionWriter.close(TopicPartitionWriter.java:372)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686477672Z     at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:296)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686484204Z     at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:121)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686490698Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:302)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686497253Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:435)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686503670Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:147)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686510252Z     at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686516660Z     at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686523122Z     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686529473Z     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686535836Z     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686542115Z     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686548634Z     at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686554892Z Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[172.18.0.4:50010, 172.18.0.2:50010], original=[172.18.0.4:50010, 172.18.0.2:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686563037Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686569484Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686577152Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686584046Z     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686590655Z [2017-08-08 10:59:09,686] ERROR Error closing writer for application_uptime-0. Error: Error closing writer: Error closing hdfs://hdfs-namenode:8020/logs/application_uptime/0/log
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686597558Z  (io.confluent.connect.hdfs.DataWriter)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687585226Z [2017-08-08 10:59:09,687] ERROR Task radar-hdfs-sink-android-15000-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687599056Z java.lang.NullPointerException
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687606712Z     at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:296)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687613465Z     at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:121)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687626278Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:302)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687633619Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:435)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687640233Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:147)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687646783Z     at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687653293Z     at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687659713Z     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687666231Z     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687672577Z     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687679296Z     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687685641Z     at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687691950Z [2017-08-08 10:59:09,687] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-08T10:59:10.190611936Z [2017-08-08 10:59:10,190] INFO Reflections took 103976 ms to scan 265 urls, producing 12687 keys and 83909 values  (org.reflections.Reflections)

looks to me like a memory issue? Could this be related or is this a different issue?

sboettcher commented 7 years ago

OK I am now thoroughly confused. I had another closer look at the files in hdfs, specifically the newest for the acceleration topic (since there are now new ones after I restarted the container), and converted them to json like above, and noticed that the timestamps were in the past, and actually from later that night after the supposed missing data started.

See below a newly created avro file from today, the last json record in that file, and the timestamp of that record:

hdfs: -rw-r--r--   3 root supergroup       9500 2017-08-08 16:06 /topicAndroidNew/android_empatica_e4_acceleration/partition=2/android_empatica_e4_acceleration+2+0021130772+0021130921.avro
json: {"key":{"userId":"UKLFRB2","sourceId":"00:07:80:F2:1A:43"},"value":{"time":1.5017942087017004E9,"timeReceived":1.501794206716E9,"x":-0.8125,"y":-0.59375,"z":-0.03125}}
timeReceived: 08/03/2017 @ 9:03pm (UTC)

Same thing for one of the files I mentioned in an earlier https://github.com/RADAR-CNS/Restructure-HDFS-topic/issues/6#issuecomment-320730406:

hdfs: -rw-r--r--   3 root supergroup       9500 2017-08-04 01:25 /topicAndroidNew/android_empatica_e4_acceleration/partition=2/android_empatica_e4_acceleration+2+0014425524+0014425673.avro
json: {"key":{"userId":"UKLFRB2","sourceId":"00:07:80:F2:1A:43"},"value":{"time":1.501706715380344E9,"timeReceived":1.501706712389E9,"x":-0.015625,"y":-0.328125,"z":0.921875}}
timeReceived: 08/02/2017 @ 8:45pm (UTC)

So to me it currently looks like the connector is slowly getting the data now, almost as if the data arrives as if it were currently collecting. Is this normal that it takes the connector very long to go through kafka and store the data? I have now disconnected all empaticas again and the connector is still committing data. Will see tomorrow if it finished. My working theory is that somewhere along the line there was a problem with one of the partitions and that was now "fixed" by restarting the connector and it is now trying to catch up...

blootsvoets commented 7 years ago

Yes, kafka will leave all data in its log for a while, to allow consumers to catch up if they get behind. If the fall too far behind though, kafka will remove the data altogether. The default time for this is set to 168 hours = 1 week (see Apache Kafka configuration docs, log.retention.<x>). If a connector falls structurally behind (for more than a week), you may loose data though.

sboettcher commented 7 years ago

Since the retention is set to 1 week, I need to get this data very soon then because the missing section of data is from almost a week ago. Or is there a way to set retention time afterwards?

So this morning a bit more data has been written but still not everything that is missing. Instead the connector now gives exceptions like this, even after restarting the container:

radar-hdfs-connector_1     | 2017-08-09T08:05:06.103581674Z [2017-08-09 08:05:06,102] ERROR Recovery failed at state RECOVERY_PARTITION_PAUSED (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103679462Z org.apache.kafka.connect.errors.ConnectException: Cannot acquire lease after timeout, will retry.
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103697513Z     at io.confluent.connect.hdfs.wal.FSWAL.acquireLease(FSWAL.java:95)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103707956Z     at io.confluent.connect.hdfs.wal.FSWAL.apply(FSWAL.java:105)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103716755Z     at io.confluent.connect.hdfs.TopicPartitionWriter.applyWAL(TopicPartitionWriter.java:484)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103725341Z     at io.confluent.connect.hdfs.TopicPartitionWriter.recover(TopicPartitionWriter.java:212)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103734395Z     at io.confluent.connect.hdfs.DataWriter.recover(DataWriter.java:239)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103743716Z     at io.confluent.connect.hdfs.DataWriter.open(DataWriter.java:281)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103753421Z     at io.confluent.connect.hdfs.HdfsSinkTask.open(HdfsSinkTask.java:116)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103762031Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:431)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103770955Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.access$1000(WorkerSinkTask.java:55)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103779925Z     at org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsAssigned(WorkerSinkTask.java:467)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103789981Z     at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:228)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103799897Z     at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:313)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104201839Z     at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:277)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104237809Z     at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:259)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104316058Z     at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1013)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104330866Z     at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:979)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104341432Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:317)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104350723Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:235)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104359762Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:172)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104369218Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:143)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104381668Z     at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104391196Z     at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104400040Z     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104409157Z     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104417605Z     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104434756Z     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104444413Z     at java.lang.Thread.run(Thread.java:745)
and like this

radar-hdfs-connector_1     | 2017-08-09T08:05:39.180107105Z [2017-08-09 08:05:39,179] ERROR Recovery failed at state RECOVERY_PARTITION_PAUSED (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180164855Z org.apache.kafka.connect.errors.ConnectException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.RecoveryInProgressException): Failed to APPEND_FILE /logs/application_record_counts/1/log for DFSClient_NONMAPREDUCE_-1505132056_27 on 172.18.0.5 because lease recovery is in progress. Try again later.
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180202772Z     at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2918)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180213965Z     at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2683)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180223592Z     at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2982)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180253706Z     at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2950)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180263546Z     at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:654)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180272330Z     at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:421)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180281153Z     at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180299276Z     at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180309263Z     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180328643Z     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180338174Z     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180346589Z     at java.security.AccessController.doPrivileged(Native Method)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180355016Z     at javax.security.auth.Subject.doAs(Subject.java:422)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180363329Z     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180371870Z     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180380057Z 
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180387965Z     at io.confluent.connect.hdfs.wal.FSWAL.acquireLease(FSWAL.java:88)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180420971Z     at io.confluent.connect.hdfs.wal.FSWAL.apply(FSWAL.java:105)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180430898Z     at io.confluent.connect.hdfs.TopicPartitionWriter.applyWAL(TopicPartitionWriter.java:484)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180443378Z     at io.confluent.connect.hdfs.TopicPartitionWriter.recover(TopicPartitionWriter.java:212)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180453374Z     at io.confluent.connect.hdfs.DataWriter.recover(DataWriter.java:239)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180462566Z     at io.confluent.connect.hdfs.DataWriter.open(DataWriter.java:281)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180482966Z     at io.confluent.connect.hdfs.HdfsSinkTask.open(HdfsSinkTask.java:116)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180492953Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:431)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180501740Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.access$1000(WorkerSinkTask.java:55)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180510326Z     at org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsAssigned(WorkerSinkTask.java:467)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180526073Z     at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:228)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180555500Z     at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:313)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180566039Z     at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:277)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180574784Z     at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:259)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180583469Z     at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1013)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180592160Z     at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:979)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180600623Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:317)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180609158Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:235)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180617246Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:172)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180635710Z     at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:143)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180644812Z     at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180653678Z     at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180661850Z     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180670274Z     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180678266Z     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180689594Z     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180707444Z     at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180718316Z Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.RecoveryInProgressException): Failed to APPEND_FILE /logs/application_record_counts/1/log for DFSClient_NONMAPREDUCE_-1505132056_27 on 172.18.0.5 because lease recovery is in progress. Try again later.
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180728378Z     at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2918)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180736994Z     at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2683)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180745568Z     at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2982)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180754778Z     at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2950)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180763072Z     at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:654)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180771494Z     at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:421)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180796772Z     at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180806612Z     at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180815206Z     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180823400Z     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180831590Z     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180839768Z     at java.security.AccessController.doPrivileged(Native Method)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180847622Z     at javax.security.auth.Subject.doAs(Subject.java:422)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180877268Z     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180886388Z     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180894530Z 
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180902300Z     at org.apache.hadoop.ipc.Client.call(Client.java:1469)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180910624Z     at org.apache.hadoop.ipc.Client.call(Client.java:1400)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180918632Z     at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180937099Z     at com.sun.proxy.$Proxy47.append(Unknown Source)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180946318Z     at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.append(ClientNamenodeProtocolTranslatorPB.java:313)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180954961Z     at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180964615Z     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180973801Z     at java.lang.reflect.Method.invoke(Method.java:498)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180982270Z     at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180991015Z     at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180999277Z     at com.sun.proxy.$Proxy48.append(Unknown Source)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181018375Z     at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1756)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181026889Z     at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1792)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181034905Z     at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1785)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181042849Z     at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:323)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181051198Z     at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:319)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181059184Z     at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181067251Z     at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:319)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181097242Z     at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1173)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181107244Z     at io.confluent.connect.hdfs.wal.WALFile$Writer.(WALFile.java:221)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181115797Z     at io.confluent.connect.hdfs.wal.WALFile.createWriter(WALFile.java:67)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181124107Z     at io.confluent.connect.hdfs.wal.FSWAL.acquireLease(FSWAL.java:73)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181132291Z     ... 26 more
blootsvoets commented 7 years ago

It looks like this is the same thing that gave errors before: empty files. See confluentinc/kafka-connect-hdfs#53. To restart, they removed the empty files giving errors.

sboettcher commented 7 years ago

ah yes I remember...

sboettcher commented 7 years ago

I didn't do anything except hadoop fs -ls /logs/*/*, showing that no log files at all for the releavnt topics existed (also not empty ones), and now it is committing again... maybe the container restart helped anyway

sboettcher commented 7 years ago

So, being optimistic and assuming that the missing data is now being completely recovered, what would be possibilities to make sure that the connector will not lag behind like this? It is really quite slow at the moment, processing maybe 1-2 HDFS files per second, while not being load intensive at all.

root@bd4fef958418:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  4.7 27.8 14484272 6877400 ?    Ssl  08:02  16:02 java -Xms256m -Xmx8g -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -Djava.awt.headless=true -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/connect-log4j.properties -cp /etc/kafka-connect/jars/*:/usr/share/java/confluent-common/*:/usr/share/java/kafka-serde-tools/*:/usr/share/java/monitoring-interceptors/*:/usr/share/java/kafka-connect-elasticsearch/*:/usr/share/java/kafka-connect-hdfs/*:/usr/share/java/kafka-connect-jdbc/*:/usr/bin/../share/java/kafka/*:/usr/bin/../share/java/confluent-support-metrics/*:/usr/share/java/confluent-support-metrics/* org.apache.kafka.connect.cli.ConnectStandalone /etc/kafka-connect/kafka-connect.properties /etc/kafka-connect/sink-hdfs.properties
root     19060  0.0  0.0  21964  3604 pts/0    Ss   13:39   0:00 /bin/bash
root     19357  0.0  0.0  19184  2308 pts/0    R+   13:41   0:00 ps aux

And being more than one day behind after only two days of recording from 3+ sources will not be feasible in the long run IMO. Might the bottleneck maybe be somewhere else? Although I can't see other containers that have heavy load in cadvisor.

blootsvoets commented 7 years ago

Solved by increasing the Kafka HDFS connector flush size.