Open shubhamb12 opened 2 days ago
Hi,
This is the relevant part of the traceback that also explains how to fix this:
java.io.IOException: Unable to write jffi binary stub to
/tmp
. SetTMPDIR
or Java propertyjava.io.tmpdir
to a read/write path that is not mounted "noexec".
If you don't mind me asking, I'd like to know a bit more about the environment where it happens:
/tmp
mounted noexec
?The answers wouldn't change anything in the advice offered by the error message, but it would help us understand when the issue happens.
Hi, Thanks for the response. The problem is we have around 400 pipelines and this issue has never creeped up for any of them. Infact for this particular pipeline this issue is transient and occur sometime when either deploying or new pods coming up. Without changing any configuration the issue doesnt creep many of times as well. So I wonder if there is a race condition which is causing the need for /tmp folder. We dont have /tmp configured for any of other pipelines. We have quite complex as well as simple flink pipelines who have never had this issue whatsoever that makes it a little special for us. This pipeline is only different by the fact that it has a very high message throughput.
Any guidance to permanently fix it would be appreciated. OS : Ubuntu 22.04.4
Our base build dockerfile
FROM docker.io/library/flink:1.18.1-scala_2.12-java11
Thanks
Btw if I disable the statsd client with noopsstatsD there is no error. So this error specifically originates from this lib
I looked a bit into what JFFI is doing, and it appears the library tries to unpack the native library into two different locations, first in /tmp
, and then in the current working directory if that fails - and glues errors from the two attempts together in reverse order. The error about the non-writeable directory would come from the attempt in the current directory, and the other exception in the traceback is this:
Caused by: java.nio.channels.ClosedByInterruptException
at java.base/java.nio.channels.spi.AbstractInterruptibleChannel.end(Unknown Source)
at java.base/sun.nio.ch.FileChannelImpl.endBlocking(Unknown Source)
at java.base/sun.nio.ch.FileChannelImpl.size(Unknown Source)
at java.base/sun.nio.ch.FileChannelImpl.transferFrom(Unknown Source)
at com.kenai.jffi.internal.StubLoader.loadFromJar(StubLoader.java:392)
It appears that either Flink or something else is trying to interrupt the thread, and the JFFI library does not handle this gracefully. This would explain the transient nature of the error and why it happens on some deployments but not others. I'm not familar with Flink enough to offer more targeted advice, but perhaps the statsd client initialization could be moved to a different phase in the task lifecycle that would not be interrupted.
The statsD is already as init step on job graph creation. I am not sure where else we can init it.
Hi!,
We are using "com.datadoghq" % "java-dogstatsd-client" % "4.2.0", as SBT dependency for our Flink application. Suddenly during HPA rescaling or any general redeployment we are seeing the following jnr related error from StatsDClientBuilder
Flink version: 1.18 DatadogClient: 3.33.0
We have already tried upgrading to latest 4.4.3 but no luck.
Thanks for any help