adoptium / adoptium-support

For end-user problems reported with our binary distributions
Apache License 2.0
45 stars 15 forks source link

Multithreaded Java program using LinkedBlockingQueue terminates unexpectedly on 32bit arm #609

Open ghost opened 1 year ago

ghost commented 1 year ago

Please provide a brief summary of the bug

On 32-bit ARM architectures (armhf and arm7l), OpenJDK Runtime Environment Temurin-17.0.4.1+1 can terminate unexpectedly. This is likely the same issue as #544 .

Please provide steps to reproduce where possible

  1. Save the attached file ReceivingService.java.txt.
  2. Rename the attachment to ReceivingService.java
  3. Run: javac ReceivingService.java
  4. Copy all *.class files to a 32-bit ARM machine.
  5. Connect to that machine.
  6. Run: nohup java ReceivingService > pi.log 2>&1 &
  7. Log out of the machine.
  8. Log back into the machine.
  9. Run: tail -f pi.log.
  10. Wait 24 hours.

ReceivingService.java.txt

Expected Results

The program runs indefinitely.

Actual Results

The Java Virtual Machine terminated in less than 24 hours. The last value of pi in the log file was:

3.1415808563240995673428105860538292970725403157835590268022700455132887564

What Java Version are you using?

openjdk version "17.0.4.1" 2022-08-12
OpenJDK Runtime Environment Temurin-17.0.4.1+1 (build 17.0.4.1+1)
OpenJDK Server VM Temurin-17.0.4.1+1 (build 17.0.4.1+1, mixed mode, sharing)

What is your operating system and platform?

$ uname -a
Linux HOSTNAME 4.14.170-2.0.1+ #1 SMP PREEMPT Tue Sep 27 18:33:20 UTC 2022 armv7l GNU/Linux

How did you install Java?

Downloaded the .tar.gz file and extracted it into /opt:

$ ls -la /opt/
total 16
drwxr-xr-x  4 root root 4096 Oct 21 22:59 .
drwxr-xr-x 22 root root 4096 Oct 21 22:59 ..
drwxr-xr-x  6 root root 4096 Oct 21 22:58 jdk-17.0.4.1+1-jre
lrwxrwxrwx  1 root root   18 Oct 21 22:59 jre -> jdk-17.0.4.1+1-jre
$ which java
/opt/jre/bin/java

Additional information

We're re-running the test on two different machines (32-bit ARM and AMD64) to see if the problem happens again.

Did it work before?

We may have seen what may be a related issue in JDK 11, as well, when using a D-Bus Java library.

Did you test with other Java versions?

See issue #544 for details about our attempts at testing with JDK 11. We could not reliably reproduce the issue with the code in that ticket. The code attached to this ticket, however, terminated unexpectedly on 32-bit ARM using JDK 17.

Relevant log output

No logs. No core dump. No heap dump. Java simply terminates.

karianna commented 1 year ago

@DaveJarvis We're just in release week and so 17.0.5 will be avaialable soon. Can you test that when it comes out and send the log again then?

ghost commented 1 year ago

We're just in release week

We're entering regression testing soon, so it may not be feasible for us to update to 17.0.5 for a while.

ghost commented 1 year ago

We've encountered the bug on JDK 17.0.4 and caught the stack trace this time.

2022-11-21T23:57:46.804595+00:00 APP-val app: ERROR [DBus-Signal-Receiver-1, AppRepeater] Exception in thread "DBus-Signal-Receiver-1"
2022-11-21T23:57:46.810339+00:00 APP-val app: ERROR [DBus-Signal-Receiver-2, AppRepeater] Exception in thread "DBus-Signal-Receiver-2"
2022-11-21T23:57:46.811329+00:00 APP-val app: ERROR [DBus-Signal-Receiver-1, AppRepeater] java.lang.NullPointerException: Cannot read field "item" because "<local2>" is null
2022-11-21T23:57:46.823741+00:00 APP-val app: ERROR [DBus-Signal-Receiver-3, AppRepeater] Exception in thread "DBus-Signal-Receiver-3"
2022-11-21T23:57:46.825536+00:00 APP-val app: ERROR [DBus-Signal-Receiver-1, AppRepeater] #011at java.base/java.util.concurrent.LinkedBlockingQueue.dequeue(Unknown Source)
2022-11-21T23:57:46.827412+00:00 APP-val app: ERROR [DBus-Signal-Receiver-1, AppRepeater] #011at java.base/java.util.concurrent.LinkedBlockingQueue.take(Unknown Source)
2022-11-21T23:57:46.827865+00:00 APP-val app: ERROR [DBus-Signal-Receiver-4, AppRepeater] Exception in thread "DBus-Signal-Receiver-4"
2022-11-21T23:57:46.829711+00:00 APP-val app: ERROR [DBus-Signal-Receiver-1, AppRepeater] #011at java.base/java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
2022-11-21T23:57:46.834601+00:00 APP-val app: ERROR [DBus-Signal-Receiver-5, AppRepeater] Exception in thread "DBus-Signal-Receiver-5"
2022-11-21T23:57:46.835439+00:00 APP-val app: ERROR [DBus-Signal-Receiver-6, AppRepeater] Exception in thread "DBus-Signal-Receiver-6"
2022-11-21T23:57:46.835971+00:00 APP-val app: ERROR [DBus-Signal-Receiver-7, AppRepeater] Exception in thread "DBus-Signal-Receiver-7"
2022-11-21T23:57:46.840623+00:00 APP-val app: ERROR [DBus-Signal-Receiver-8, AppRepeater] Exception in thread "DBus-Signal-Receiver-8"
2022-11-21T23:57:46.859099+00:00 APP-val app: ERROR [DBus-Signal-Receiver-9, AppRepeater] Exception in thread "DBus-Signal-Receiver-9"
2022-11-21T23:57:46.860048+00:00 APP-val app: ERROR [DBus-Signal-Receiver-1, AppRepeater] #011at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
2022-11-21T23:57:46.860373+00:00 APP-val app: ERROR [DBus-Signal-Receiver-1, AppRepeater] #011at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
2022-11-21T23:57:46.860884+00:00 APP-val app: ERROR [DBus-Signal-Receiver-1, AppRepeater] #011at java.base/java.lang.Thread.run(Unknown Source)
2022-11-21T23:57:46.861349+00:00 APP-val app: ERROR [DBus-Signal-Receiver-9, AppRepeater] java.lang.NullPointerException: Cannot read field "next" because "<local1>" is null
2022-11-21T23:57:46.861793+00:00 APP-val app: ERROR [DBus-Signal-Receiver-9, AppRepeater] #011at java.base/java.util.concurrent.LinkedBlockingQueue.dequeue(Unknown Source)
2022-11-21T23:57:46.862291+00:00 APP-val app: ERROR [DBus-Signal-Receiver-9, AppRepeater] #011at java.base/java.util.concurrent.LinkedBlockingQueue.take(Unknown Source)
2022-11-21T23:57:46.862715+00:00 APP-val app: ERROR [DBus-Signal-Receiver-9, AppRepeater] #011at java.base/java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
2022-11-21T23:57:46.863181+00:00 APP-val app: ERROR [DBus-Signal-Receiver-9, AppRepeater] #011at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
2022-11-21T23:57:46.863657+00:00 APP-val app: ERROR [DBus-Signal-Receiver-9, AppRepeater] #011at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
2022-11-21T23:57:46.864193+00:00 APP-val app: ERROR [DBus-Signal-Receiver-9, AppRepeater] #011at java.base/java.lang.Thread.run(Unknown Source)
ghost commented 1 year ago

From the log, it looks like there may be some concurrency issues with LinkedBlockingQueue.java on 32-bit ARMv7:

private E dequeue() {
  // assert takeLock.isHeldByCurrentThread();
  // assert head.item == null;
  Node<E> h = head;
  Node<E> first = h.next;
  h.next = h; // help GC
  head = first;
  E x = first.item;
  first.item = null;
  return x;
}

From:

2022-11-21T23:57:46.811329+00:00 APP-val app: ERROR [DBus-Signal-Receiver-1, AppRepeater] java.lang.NullPointerException: Cannot read field "item" because "<local2>" is null

That looks like dereferencing first.item is failing because first is null, meaning the following lines are problematic:

  Node<E> first = h.next;
  // ...
  first.item = null;

In what situations can h.next become null?

From:

2022-11-21T23:57:46.861349+00:00 APP-val app: ERROR [DBus-Signal-Receiver-9, AppRepeater] java.lang.NullPointerException: Cannot read field "next" because "<local1>" is null

In what situations can head become null? It looks like the following line is failing because head, and therefore h, is null:

  Node<E> first = h.next;

Maybe there's a difference between the concurrency locking mechanism on 32-bit ARM versus AMD/Intel (and/or 64-bit CPUs) that allows dequeue() and take() to be called when the head pointer is being manipulated?

See: https://github.com/hypfvieh/dbus-java/issues/172

jerboaa commented 1 year ago

Reported upstream as: https://bugs.openjdk.org/browse/JDK-8299086

ghost commented 1 year ago

@jerboaa , note that we don't have reliable reproduction steps and the steps shown in the attached file won't necessarily cause the issue to occur. All we know is that when using D-Bus Java in our application, at some point this NPE is triggered.

With JDK 11, we didn't really know what line of code was triggering the issue. With JDK 17, we have now pinpointed the problem.

Further, if you trace through https://github.com/hypfvieh/dbus-java/issues/172 you'll see that the D-Bus Java code tries to capture this particular NPE so that we can shut down our application:

.withRetryHandler((t, ex) -> {
  // A NullPointerException occurred; the JVM has entered an irrecoverable state.
  new ShutdownEvent(ex).publish();

  // Terminate the D-Bus thread's retry handler.
  return false;
})

The retry handler is where the NPE is encountered, but never caught. You can see the try/catch statement here:

https://github.com/hypfvieh/dbus-java/blob/7b4a311c99c8503f10a19a57a7c0470a9befe0e7/dbus-java-core/src/main/java/org/freedesktop/dbus/connections/ReceivingService.java#L115

The retry handler should be called when the NPE happens, but isn't:

https://github.com/hypfvieh/dbus-java/blob/7b4a311c99c8503f10a19a57a7c0470a9befe0e7/dbus-java-core/src/main/java/org/freedesktop/dbus/connections/ReceivingService.java#L133

It's quite puzzling. The JVM encounters an NPE in the queue, the code that called the queue has a try/catch block, but the corresponding catch is never entered. This leads to runaway logging in our application that can only be mitigated by monitoring the log file for NullPointerException using an external process.

If possible, would you mind updating https://bugs.openjdk.org/browse/JDK-8299086 to clarify that the attached steps haven't been used to reproduce the problem, but demonstrate what our code appears to be doing to cause the issue? (We've run the attached steps for weeks on end without hitting the snag. Our application, on the other hand, has encountered the problem a few times now.)

github-actions[bot] commented 1 year ago

We are marking this issue as stale because it has not been updated for a while. This is just a way to keep the support issues queue manageable. It will be closed soon unless the stale label is removed by a committer, or a new comment is made.

github-actions[bot] commented 1 year ago

We are marking this issue as stale because it has not been updated for a while. This is just a way to keep the support issues queue manageable. It will be closed soon unless the stale label is removed by a committer, or a new comment is made.

ghost commented 1 year ago

It's still an issue, but I suspect that by the time we get around to being able to replicate it reliably, we'll have upgraded from 32-bit to 64-bit hardware.

ghost commented 1 year ago

FYI, in two days I will be locked out of my account once GitHub begins enforcing MFA. Sorry I won't be able to help any further.