awslabs / amazon-kinesis-producer

Amazon Kinesis Producer Library
Apache License 2.0
399 stars 331 forks source link

addUserRecord call throws DaemonException #39

Closed heikkiv closed 5 years ago

heikkiv commented 8 years ago

Sometimes calling addUserRecord starts to throw:

com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.
    at com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:171) ~[amazon-kinesis-producer-0.10.2.jar:na]
    at com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:467) ~[amazon-kinesis-producer-0.10.2.jar:na]
    at com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:338) ~[amazon-kinesis-producer-0.10.2.jar:na]

The KPL does not seems to recover from this. All further calls to addUserRecord also fail. Restarting the KPL java process fixes the situation.

This seems to happen when the kinesis stream is throttling requests so my guess is that the native process cant write to the stream quickly enough and runs out of memory. If that's the case my expectation would be that the native process should start to discard older data and of course that if the native process dies the KPL recovers to a working state.

cgpassante commented 8 years ago

I received the same error in production yesterday on a service that has been running smoothly for months...

2016-01-27T00:10:10,299 ERROR [com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask] ShardId shardId-000000000000: Application processRecords() threw an exception when processing shard com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.

MusikPolice commented 8 years ago

I'm running a lambda function that uses the KPL, and it appears to crash with this exception when the lambda runs out of memory

quiqua commented 7 years ago

We keep running into similar issues when using the KPL library on AWS Lambda. At the moment we are using the aggressive flushSync method and reduce the RecordMaxBufferedTime to work around the issues.

pfifer commented 7 years ago

Thanks for reporting this. We are investigating this, but could use some additional information. It appears that people are still seeing this with the 0.12.x versions of the KPL. How commonly does this occur?

We will investigate adding memory usage tracking for the KPL native process to help determine how much memory it's consuming.

Can everyone who is affected by this please respond or add a reaction to help us reaction to help us prioritize this issue.

xujiaxj commented 7 years ago

We upgraded to 0.12.3 in our production two days ago, but since then the occurrence of this issue is much frequent than previous 0.10.2 version. (Please note as part of the upgrade, we also upgraded AWS SDK from 1.10.49 to 1.11.45.)

We have 10 m4.2xlarge instances in eu-west-1 that produce a high amount of traffic to our Kinesis stream in us-west-2, and they are the ones that frequently run into this issue. Other producer instances in other regions seem to be fine, but they have less traffic.

Our motive of upgrade is to hope it can solve the memory leak issue, but it looks like the stability has degraded instead.

I will open a separate support ticket to give more information.

yeshodhan commented 7 years ago

Facing similar issue at our side: com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages. at com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:171) at com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:467) at com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:338)

we are using amazon-kinesis-producer - v0.10.2 and amazon-kinesis-client v1.7.0, aws java sdk v1.11.77.

it issue is intermittent and I'm unable to figure out the root cause.

quiqua commented 7 years ago

Well we moved away from AWS KPL and are using AWS SDK for Java now to stream data to Kinesis. It didn't work out for us in the end as we saw those errors quite frequently.

xujiaxj commented 7 years ago

On a few occasions we ran into the crashed KPL, we also observed our JVM cannot create more threads, even though the thread count in JVM is very stable.

Exception in thread "qtp213045289-24084" java.lang.OutOfMemoryError: unable to create new native thread

I wonder whether that's because the KPL process has somehow prevented its parent JVM process to get more native threads.

ramachpa commented 7 years ago

We see this on calling flushSync method as well. We initially thought the host do not have enough memory for KPL to do its job. But even after increasing the memory, we continue to see this.

com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages. at com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:175) at com.amazonaws.services.kinesis.producer.KinesisProducer.flush(KinesisProducer.java:716) at com.amazonaws.services.kinesis.producer.KinesisProducer.flush(KinesisProducer.java:735) at com.amazonaws.services.kinesis.producer.KinesisProducer.flushSync(KinesisProducer.java:760)

sshrivastava-incontact commented 7 years ago

want to point out one observation, not sure how helpful this will be... I tried (on windows) with DATA_SIZE = 10; SECONDS_TO_RUN = 1; RECORDS_PER_SECOND = 1;

and following is log (which point to file/pipe \.\pipe\amz-aws-kpl-in-pipe- not found )

[main] INFO com.amazonaws.services.kinesis.producer.KinesisProducer - Extracting binaries to C:\Users\xxxx~1\AppData\Local\Temp\amazon-kinesis-producer-native-binaries
[main] INFO com.amazonaws.samples.SampleProducer - Starting puts... will run for 1 seconds at 1 records per second
[pool-1-thread-1] INFO com.amazonaws.samples.SampleProducer - Put 1 of 1 so far (100.00 %), 0 have completed (0.00 %)
[main] INFO com.amazonaws.samples.SampleProducer - Waiting for remaining puts to finish...
Exception in thread "main" com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.
    at com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:171)
    at com.amazonaws.services.kinesis.producer.KinesisProducer.flush(KinesisProducer.java:708)
    at com.amazonaws.services.kinesis.producer.KinesisProducer.flush(KinesisProducer.java:727)
    at com.amazonaws.services.kinesis.producer.KinesisProducer.flushSync(KinesisProducer.java:752)
    at com.amazonaws.samples.SampleProducer.main(SampleProducer.java:262)
[pool-3-thread-2] ERROR com.amazonaws.services.kinesis.producer.KinesisProducer - Error in child process
com.amazonaws.services.kinesis.producer.IrrecoverableError: Unexpected error connecting to child process
    at com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:502)
    at com.amazonaws.services.kinesis.producer.Daemon.access$1200(Daemon.java:61)
    at com.amazonaws.services.kinesis.producer.Daemon$6.run(Daemon.java:447)
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.nio.file.NoSuchFileException: \\.\pipe\amz-aws-kpl-in-pipe-878f15f7
    at sun.nio.fs.WindowsException.translateToIOException(Unknown Source)
    at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
    at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
    at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(Unknown Source)
    at java.nio.channels.FileChannel.open(Unknown Source)
    at java.nio.channels.FileChannel.open(Unknown Source)
    at com.amazonaws.services.kinesis.producer.Daemon.connectToChild(Daemon.java:329)
    at com.amazonaws.services.kinesis.producer.Daemon.access$1000(Daemon.java:61)
    at com.amazonaws.services.kinesis.producer.Daemon$6.run(Daemon.java:444)
    ... 5 more
[kpl-callback-pool-0-thread-0] ERROR com.amazonaws.samples.SampleProducer - Exception during put
com.amazonaws.services.kinesis.producer.IrrecoverableError: Unexpected error connecting to child process
    at com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:502)
    at com.amazonaws.services.kinesis.producer.Daemon.access$1200(Daemon.java:61)
    at com.amazonaws.services.kinesis.producer.Daemon$6.run(Daemon.java:447)
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.nio.file.NoSuchFileException: \\.\pipe\amz-aws-kpl-in-pipe-878f15f7
    at sun.nio.fs.WindowsException.translateToIOException(Unknown Source)
    at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
    at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
    at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(Unknown Source)
    at java.nio.channels.FileChannel.open(Unknown Source)
    at java.nio.channels.FileChannel.open(Unknown Source)
    at com.amazonaws.services.kinesis.producer.Daemon.connectToChild(Daemon.java:329)
    at com.amazonaws.services.kinesis.producer.Daemon.access$1000(Daemon.java:61)
    at com.amazonaws.services.kinesis.producer.Daemon$6.run(Daemon.java:444)
    ... 5 more
buremba commented 7 years ago

We plan to use this library for high workloads but it looks like it doesn't prevent the native process to crash in high workloads. When the process is dead, our application stops working, it would be great if you could focus on this issue.

pfifer commented 7 years ago

@sshrivastava-incontact what you're seeing is related to running the KPL 0.12.x on Windows which isn't currently supported.

For those running on Linux, and Mac OS X: The newest version of the KPL includes some additional logging about how busy the sending process is. See the Release Notes for 0.12.4 on the meaning of the log messages. Under certain circumstances the native component can actually run itself out of threads, which will trigger a failure of the native process.

JasonQi-swe commented 7 years ago

When can this KPL works with Windows? I am OK with version 0.12. does not work with Windows. However, the version 0.10. also does not work, since I always got this Exception: ERROR KinesisProducer:148 - Error in child process com.amazonaws.services.kinesis.producer.IrrecoverableError: Unexpected error connecting to child process at com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:502) at com.amazonaws.services.kinesis.producer.Daemon.access$1200(Daemon.java:61) at com.amazonaws.services.kinesis.producer.Daemon$6.run(Daemon.java:447) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.nio.file.NoSuchFileException: \.\pipe\amz-aws-kpl-in-pipe-6cd6f933 at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:79) at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102) at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(WindowsFileSystemProvider.java:115) at java.nio.channels.FileChannel.open(FileChannel.java:287) at java.nio.channels.FileChannel.open(FileChannel.java:335) at com.amazonaws.services.kinesis.producer.Daemon.connectToChild(Daemon.java:329) at com.amazonaws.services.kinesis.producer.Daemon.access$1000(Daemon.java:61) at com.amazonaws.services.kinesis.producer.Daemon$6.run(Daemon.java:444) ... 5 more

And it is not only me, this guy also got same problem:
https://stackoverflow.com/questions/43113791/getting-error-while-running-amazon-kinesis-producer-sample

Looking forward to your update and thank you so much!

Sincerely

oletap commented 7 years ago

I am using Kinesis Producer 0.10.2. It was running fine on Windows 7 and when I try to set it up on CentOS 6.5, I am getting the error below:

Error in child process com.amazonaws.services.kinesis.producer.IrrecoverableError: Error starting child process at com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:502) at com.amazonaws.services.kinesis.producer.Daemon.startChildProcess(Daemon.java:455) at com.amazonaws.services.kinesis.producer.Daemon.access$100(Daemon.java:61) at com.amazonaws.services.kinesis.producer.Daemon$1.run(Daemon.java:128) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Cannot run program "/tmp/amazon-kinesis-producer-native-binaries/kinesis_producer_544f70b9f23702394d1c2f983b07d4c0242aff01": error=13, Permission denied at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at com.amazonaws.services.kinesis.producer.Daemon.startChildProcess(Daemon.java:453) ... 7 more Caused by: java.io.IOException: error=13, Permission denied at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.(UNIXProcess.java:247) at java.lang.ProcessImpl.start(ProcessImpl.java:134) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ... 8 more

Can anybody help me on this? Thanks.

bijith commented 7 years ago

Getting the same exception while trying https://github.com/awslabs/amazon-kinesis-producer/blob/master/java/amazon-kinesis-producer-sample/src/com/amazonaws/services/kinesis/producer/sample/SampleProducer.java

17/09/01 16:53:58 INFO SampleProducer: Put 1998 of 10000 so far (19.98 %), 0 have completed (0.00 %)
17/09/01 16:53:58 ERROR SampleProducer: Error running task
com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.
    at com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:176)
    at com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:477)
udayravuri commented 7 years ago

Seems like switching to root on Linux & running the example: SampleProducer.java showed some puts going through before it fails.

Environment: KPL v 0.12.5 on Linux (Linux els-d93322 2.6.34.7-66.fc13.i686.PAE #1 SMP Wed Dec 15 07:21:49 UTC 2010 i686 i686 i386 GNU/Linux)

Looking at this error line suggests that it is trying to extract binaries to /tmp: [com.amazonaws.services.kinesis.producer.sample.SampleProducer.main()] INFO com.amazonaws.services.kinesis.producer.KinesisProducer - Extracting binaries to /tmp/amazon-kinesis-producer-native-binaries

So I swithced to root & now it's at least doing 2 puts & then fails again with the same exception. But at least it did 2 puts: [INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ amazon-kinesis-producer-sample --- [com.amazonaws.services.kinesis.producer.sample.SampleProducer.main()] INFO com.amazonaws.services.kinesis.producer.KinesisProducer - Extracting binaries to /tmp/amazon-kinesis-producer-native-binaries [com.amazonaws.services.kinesis.producer.sample.SampleProducer.main()] INFO com.amazonaws.services.kinesis.producer.sample.SampleProducer - Starting puts... will run for 5 seconds at 2000 records per second [kpl-daemon-0000] INFO com.amazonaws.services.kinesis.producer.Daemon - Asking for trace [pool-1-thread-1] INFO com.amazonaws.services.kinesis.producer.sample.SampleProducer - Put 1999 of 10000 so far (19.99 %), 0 have completed (0.00 %) [pool-1-thread-1] INFO com.amazonaws.services.kinesis.producer.sample.SampleProducer - Put 3999 of 10000 so far (39.99 %), 0 have completed (0.00 %) [pool-1-thread-1] ERROR com.amazonaws.services.kinesis.producer.sample.SampleProducer - Error running task com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages. at com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:176) at com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:477) at com.amazonaws.services.kinesis.producer.sample.SampleProducer$2.run(SampleProducer.java:216) at com.amazonaws.services.kinesis.producer.sample.SampleProducer$4.run(SampleProducer.java:298) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

udayravuri commented 7 years ago

Regarding the exception above, it was my mistake. I was attempting to write to a Kinesis Stream while the actual endpoint was a Kinesis Firehose. After making that correction, I was able to successfully write to Kinesis Firehose using this example: https://github.com/awslabs/aws-big-data-blog/blob/master/aws-blog-firehose-lambda/kinesisFirehose/src/main/java/com/amazonaws/proserv/lambda/KinesisToFirehose.java

ppearcy commented 7 years ago

Any chance anyone has any workarounds? Has anyone had any version of the KPL working on any version of windows with any version of java?

The issue looks similar: https://bugs.openjdk.java.net/browse/JDK-8152922

Is the only known workaround use the native Java producer?

It might be a good idea to update the README as it is a pretty nasty surprise when the claim is made that 0.10.2 works on windows.

bk-mz commented 6 years ago

From what we have observed, it's not a bug, but by-design behaviour.

The thing is, message to kinesis is being stored (put operation) on blocking queue, which has InterruptedException as checked exception in it's signature.

This operation is being made on the calling thread in the KPL code, so, if there is a possibility for the calling thread to be interrupted, catch logic would be invoked:

/**
 * Enqueue a message to be sent to the child process.
 * 
 * @param m
 */
public void add(Message m) {
    if (shutdown.get()) {
        throw new DaemonException(
                "The child process has been shutdown and can no longer accept messages.");
    }

    try {
        outgoingMessages.put(m); //<-- HERE 
    } catch (InterruptedException e) {
        fatalError("Unexpected error", e);
    }
}

Actually, fatalError call will terminate the daemon:

private synchronized void fatalError(String message, Throwable t, boolean retryable) {
    if (!shutdown.getAndSet(true)) {
        if (process != null) {
            process.destroy(); //<-- HERE PROCESS IS DESTROYED
        }
        try {
            executor.awaitTermination(1, TimeUnit.SECONDS);
        } catch (InterruptedException e) { }
        executor.shutdownNow();
        // other code
    }
}

So, to workaround this issue make sure you are not invoking addUserRecord on the thread which might be interrupted, e.g. any of your server request handling thread.

This piece of code did the thing for us, make sure something similar is happening in your code:

private void sendToKinesis(ByteBuffer buffer) {
    CompletableFuture.runAsync(() -> {
        try {
            streamProducer.addUserRecord(...)
        } catch (Throwable e) {
            log.error("Failed to send to Kinesis: {}", e.getMessage());
        }
    }); // run on separate thread pool
}
spjegan commented 6 years ago

We are seeing this same issue with version 0.10.2 on Linux. I do not really have a root cause. Is there any work-around for this issue? Has anyone tried the above proposed work-around?

Thanks.

pfifer commented 6 years ago

@spjegan 0.10.x is no longer supported, and there is a required upgrade to 0.12.x. In 0.12.x automatic restarts of a failed Daemon process were added.

pfifer commented 6 years ago

@head-thrash In 0.12.x there was higher level automatic restart added, that should restart the native process if it exits. During the process exit you're correct that you will receive the exception, but after the process is restarted publishing should be available.

pfifer commented 6 years ago

@ppearcy Windows support was added in Release 0.12.6.

pfifer commented 6 years ago

@udayravuri During startup the library attempts to extract the native component to the temp directory. The user your application is running as, must have write access to the directory. You can use setTempDirectory on the configuration to control where it extracts to. There is a bug #161 right now with the certificates that we're looking into.

pfifer commented 6 years ago

@xujiaxj The threading issue was fixed/mitigated in #100, with more information to assist in determining configuration added in #102. These were released in 0.12.4

xujiaxj commented 6 years ago

@pfifer thanks

bk-mz commented 6 years ago

@pfifer Still, the method returns ListenableFuture, which leads to a false assumption that all the exceptions will be thrown upon the future execution.

Developers create onFailure handler methods, just to know that the exception will be thrown in sync code somewhere on production :(

lasseNedergaard commented 6 years ago

Hi I running Flink on EMR cluster and send data through Flink Kinesis producer. Flink use 0.12.5 and I get many of the error below. Any Idea how to workaround this problem, or how I can help debug it?

rg.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages. at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:176) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:477) at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisProducer.invoke(FlinkKinesisProducer.java:248)

pfifer commented 6 years ago

@head-thrash It's somewhat analogous to an Executor which will throw RejectedExecutionException when the Executor has been shutdown. The change to automatically restart the KPL though does make it different. Unfortunately the change would be a breaking change, which is something we won't include in minor versions.

pfifer commented 6 years ago

@lnetrackunit Are there any logs from LogInputStreamReader from around the time you see that message. It indicates that the native process has either exited, or failed.

lasseNedergaard commented 6 years ago

Sorry I couldn't find any. When I encounter this problem I was recovering from a failure, so I was starting up and writing as much as possible and was writing to 3 streams from the same job. In between the errors I also got rate exceeded exceptions, but tried to workaround this by increasing the number of shards.

yjc801 commented 6 years ago

Also ran into this issue, but in our case it is simply because the aws credential was missing, so KPL couldn't start the child process. The exception is caught in startChildProcess, which then calls fatalError and set shutdown flag to be true, therefore when calling add, it checks the flag and throws DaemonException: "The child process has been shutdown and can no longer accept messages.".

Hope it's helpful.

vikrantch-hk commented 6 years ago

@yjc801 did you find any solution, we are also facing same issue.

lasseNedergaard commented 6 years ago

No. I had to restart and accept losing data and after that I didn't see the exception

pfifer commented 6 years ago

I'll see if I can make sure the KPL log more at start up then it does today. It might help catching these cases where the KPL is missing some things it needs at startup.

lasseNedergaard commented 6 years ago

News ;) I was in a migrating process to a new Flink cluster. I had deployed our upgraded pipeline that read from Kinesis and write results back again. It began to fail and I found the error below in the log. Both Kinesis and our EMR cluster are running in EU-west-1. I figured out that i was missing a stream.

As an side note another job had a Kinesis Read timeout during the night but at the same time other jobs run without any problems. See the error below the first one. Perhaps KPL don't handle network outstage well.

@pfifer Can you see an valid solution, or are the best way out of this to install Kafka?

2018-03-21 07:54:36,545 WARN org.apache.flink.streaming.connectors.kinesis.FlinkKinesisProducer - An exception occurred while processing a record org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.UserRecordFailedException at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler.onPutRecordResult(KinesisProducer.java:197) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler.access$000(KinesisProducer.java:131) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler$1.run(KinesisProducer.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2018-03-21 07:54:36,545 WARN org.apache.flink.streaming.connectors.kinesis.FlinkKinesisProducer - An exception occurred while processing a record org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.UserRecordFailedException at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler.onPutRecordResult(KinesisProducer.java:197) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler.access$000(KinesisProducer.java:131) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler$1.run(KinesisProducer.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2018-03-21 07:54:36,562 WARN org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.LogInputStreamReader - [2018-03-21 07:54:36.561726] [0x00000210][0x00007f90e523c700] [warning] AWS Log: WARNEncountered AWSError ResourceNotFoundException Stream mtecparsed under account 24........... Not found

org.apache.flink.kinesis.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to kinesis.eu-west-1.amazonaws.com:443 [kinesis.eu-west-1.amazonaws.com/52.94.221.240] failed: Read timed out at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1068) at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1034) at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741) at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715) at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697) at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665) at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647) at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2219) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2195) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1004) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:980) at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getRecords(KinesisProxy.java:201) at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.getRecords(ShardConsumer.java:293) at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:201) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.flink.kinesis.shaded.org.apache.http.conn.ConnectTimeoutException: Connect to kinesis.eu-west-1.amazonaws.com:443 [kinesis.eu-west-1.amazonaws.com/52.94.221.240] failed: Read timed out at org.apache.flink.kinesis.shaded.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151) at org.apache.flink.kinesis.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359) at sun.reflect.GeneratedMethodAccessor221.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.kinesis.shaded.com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) at org.apache.flink.kinesis.shaded.com.amazonaws.http.conn.$Proxy137.connect(Unknown Source) at org.apache.flink.kinesis.shaded.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381) at org.apache.flink.kinesis.shaded.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237) at org.apache.flink.kinesis.shaded.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) at org.apache.flink.kinesis.shaded.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at org.apache.flink.kinesis.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.flink.kinesis.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at org.apache.flink.kinesis.shaded.com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72) at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1189) at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1029) ... 18 more Caused by: java.net.SocketTimeoutException: Read timed out

lasseNedergaard commented 6 years ago

I have to add that my job is writing to 4 diff. Kinesis streams. and on startup I process 10K messages / sec. And I see this in my log 2018-03-21 12:31:25,155 WARN org.apache.flink.streaming.connectors.kinesis.FlinkKinesisProducer - An exception occurred while processing a record java.lang.RuntimeException: Unexpected error at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:521) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:501) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:183) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer.flush(KinesisProducer.java:718) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer.flush(KinesisProducer.java:737) at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisProducer.flushSync(FlinkKinesisProducer.java:333) at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisProducer.close(FlinkKinesisProducer.java:260) at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43) at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117) at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:446) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:351) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:181) ... 10 more

lasseNedergaard commented 6 years ago

I have tried to throttle our read input and by doing that I can get it to work, so I guess that it has something to do with high load and perhaps memory pressure

fmthoma commented 6 years ago

I have tried to use the Java KPL twice, once inside a Lambda script that reads from one Kinesis stream and writes to another (KPL 0.12.8), and once in a Flink job using flink-connector-kinesis (KPL 0.12.5). In both cases I ran almost immediately into this exception (within the first second of the application running). For the Lambda, I could get it to work by switching to the bare AWS Kinesis client (using the kinesis aggregator manually), but unfortunately this is not an option for the Flink job.

I've tried rate limiting down to 1% of the shard capacity, and the exception still occurs within the first second.

What am I doing wrong? Is there something I can do to make the KPL work and not throw this exception?

fmthoma commented 6 years ago

What am I doing wrong?

I was trying to run the KPL on an alpine docker base image, which is apparently a bad idea. Switching to Debian slim did the trick. Unfortunately, the exception does not convey the reason, and Flink does not show logs other than the caught exceptions.

pfifer commented 6 years ago

Unfortunately the KPL is built against glibc while Alpine Linux uses musl libc. This causes the native component to fail runtime linking, and crash. There appears to be some Docker images that include glibc, but I can't vouch for whether they would work or not.

linksach3 commented 6 years ago

Hi, I am using KPL(0.12.8) in AWS lambda function coded using java8. Here is my code snipped:

`// put records and save the Futures for (MyRecord myRecord : MyRecordsList) { putFutures.add(kinesisProducer.addUserRecord(kinesis_stream_name, randomPartitionKey(), ByteBuffer.wrap(convertToJson(myRecord).getBytes("UTF-8")))); }

    // Wait for puts to finish and check the results.
    LOGGER.info("Waiting for KinesisProducer puts to finish...");
    for (Future<UserRecordResult> f : putFutures) {
        UserRecordResult result = f.get();
        if (result.isSuccessful()) {
            // Increment the success count for this run
            recordsPut.getAndIncrement();
            long totalTime = result.getAttempts().stream().mapToLong(a -> a.getDelay() + a.getDuration()).sum();
            LOGGER.debug(
                    String.format("Successfully put record #%d, sequence_no %s took %d  attempts, totalling %d ms.",
                            recordsPut.get(), result.getSequenceNumber(), result.getAttempts().size(), totalTime));

        } else {

            Attempt last = Iterables.getLast(result.getAttempts());
            LOGGER.error(
                    String.format("Record failed to put - %s : %s", last.getErrorCode(), last.getErrorMessage()));
            throw new UserRecordFailedException(result);

        }

    }
    kinesisProducer.flushSync();`

This code works fine for most of the time. I just ran a load with millions of records and i see that hundreds of these failed with below error:

com.amazonaws.services.kinesis.producer.UserRecordFailedException java.util.concurrent.ExecutionException: com.amazonaws.services.kinesis.producer.UserRecordFailedException at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[task/:?] at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[task/:?] at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[task/:?]

at the code line f.get()

Can you please suggest something as a workaround? I am using all default configurations for KPL other than below: config.setAggregationMaxSize(512000);

linksach3 commented 6 years ago

Also, in addition to above, i am seeing below errors:

java.lang.RuntimeException: EOF reached during read java.util.concurrent.ExecutionException: java.lang.RuntimeException: EOF reached during read at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[task/:?] at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[task/:?] at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[task/:?]

linksach3 commented 6 years ago

Also: com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.

cgpassante commented 6 years ago

I have been able to cause this problem by calling the KPL from a catch block for an InterruptException in a child thread. I was using the KPL for logging and attempted to log when a child thread was interrupted from the catch block. Unfortunately the KPL has a shutdown hook that runs when the thread it was called from exits, not the thread where the KPL was instantiated. The hook kills the daemon process that writes events to the shards. I would say this is a KPL bug and that they should only kill the daemon when the parent thread of the KPL object is killed.

rakhu commented 6 years ago

Getting this frequent failures. 0.12.9 jar Windows 2012 R2 Standard VMWare Intel 2GHZ (4 processors) 6 GB RAM. During this time CPU is going 100% and bringing down the windows server. So as a workaround onFailure i writen to destroy (flushSync and destroy didnt help ). But in this case i am loosing all the outstanding records. Can you help me to know if anyway to solve this ? Atlease i need a way to get the message and reprocess it.

Note: All Kinesis default config used.

vijaychd commented 6 years ago

Worked fine in local. But getting heap size issues running in docker container linux. How much heap this needs?

indraneelr commented 5 years ago

Any updates or workarounds for this issue? still facing this with 0.12.9.

org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:176)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer.flush(KinesisProducer.java:784)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer.flush(KinesisProducer.java:804)
        at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisProducer.flushSync(FlinkKinesisProducer.java:404)
        at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisProducer.close(FlinkKinesisProducer.java:305)
        at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43)
        at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:477)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:378)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
        at java.lang.Thread.run(Thread.java:748)
1513132 [kpl-daemon-0000] ERROR org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer - Error in child process
java.lang.RuntimeException: Child process exited with code 137
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:533)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:509)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.startChildProcess(Daemon.java:487)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.access$100(Daemon.java:63)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon$1.run(Daemon.java:133)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
1513132 [kpl-daemon-0000] INFO org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer - Restarting native producer process.
1513243 [kpl-daemon-0004] ERROR org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer - Error in child process
java.lang.RuntimeException: Error writing message to daemon
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:533)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:513)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.sendMessage(Daemon.java:234)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.access$400(Daemon.java:63)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon$2.run(Daemon.java:292)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.sendMessage(Daemon.java:230)
        ... 5 more
1513245 [kpl-daemon-0004] INFO org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer - Restarting native producer process.
Asamkhata071 commented 5 years ago

@pfifer Using Version 12.11 of amazon-kinesis-producer which I downloaded from : https://jar-download.com/artifacts/com.amazonaws/amazon-kinesis-producer . Running on windows 10 . Trying to run the sample code from : https://docs.aws.amazon.com/streams/latest/dev/kinesis-kpl-writing.html (Responding to Results Synchronously) .

I get the following error : [main] INFO com.amazonaws.services.kinesis.producer.KinesisProducer - Extracting binaries to C:\Users\AppData\Local\Temp\amazon-kinesis-producer-native-binaries [main] INFO com.amazonaws.services.kinesis.producer.HashedFileCopier - 'C:\Users\AppData\Local\Temp\amazon-kinesis-producer-native-binaries\kinesis_producer_36877407482F2EE24DFC2F3B20F02F0404BFA4EC.exe' already exists, and matches. Not overwriting. [kpl-daemon-0001] ERROR com.amazonaws.services.kinesis.producer.KinesisProducer - Error in child process com.amazonaws.services.kinesis.producer.IrrecoverableError: Unexpected error connecting to child process at com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:537) at com.amazonaws.services.kinesis.producer.Daemon.access$1200(Daemon.java:63) at com.amazonaws.services.kinesis.producer.Daemon$6.run(Daemon.java:460) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.nio.file.NoSuchFileException: .\pipe\amz-aws-kpl-in-pipe-f035bcfc at sun.nio.fs.WindowsException.translateToIOException(Unknown Source) at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source) at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source) at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(Unknown Source) at java.nio.channels.FileChannel.open(Unknown Source) at java.nio.channels.FileChannel.open(Unknown Source) at com.amazonaws.services.kinesis.producer.Daemon.connectToChild(Daemon.java:347) at com.amazonaws.services.kinesis.producer.Daemon.access$1000(Daemon.java:63) at com.amazonaws.services.kinesis.producer.Daemon$6.run(Daemon.java:457) ... 3 more [kpl-daemon-0001] INFO com.amazonaws.services.kinesis.producer.KinesisProducer - Restarting native producer process. java.util.concurrent.ExecutionException: com.amazonaws.services.kinesis.producer.IrrecoverableError: Unexpected error connecting to child process at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) at KPLSimple.main(KPLSimple.java:58)

Which version of amazon-kinesis-producer jar (aka KPL jar) for windows 10 should I be using ?

Asamkhata071 commented 5 years ago

@pfifer

Also tried using version 0.10.2 , same program and configuration as above: I get the following error:

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. starts. [2019-04-23 10:54:38.002344] [0x00005134] [info] [metrics_manager.h:148] Uploading metrics to monitoring.us-east-2.amazonaws.com:443 [2019-04-23 10:54:38.060220] [0x00005134] [info] [kinesis_producer.cc:79] Created pipeline for stream "KinesisStream" [2019-04-23 10:54:38.060220] [0x00005134] [info] [shard_map.cc:83] Updating shard map for stream "KinesisStream" [2019-04-23 10:54:38.158691] [0x00005240] [error] [http_client.cc:148] Failed to open connection to monitoring.us-east-2.amazonaws.com:443 : certificate verify failed [2019-04-23 10:54:38.249706] [0x00005240] [error] [http_client.cc:148] Failed to open connection to kinesis.us-east-2.amazonaws.com:443 : certificate verify failed

I don't encounter this "certificate verify failed" with higher versions of producer (example 12.x ) , why so ?