Open Ostrzyciel opened 1 month ago
Do you have a reproducible - in java! - example?
The change at 8654c07 is for a reason - see GH-2675. It is supposed to hold up concurrent initialization because that causes an NPE.
For scalatest, it would be better to explicitly call JenaSystem.init
before things go parallel.
Anecdotally I've definitely noticed similar issues with upgrading to 5.2.0, generally this is fixed by adding an explicit JenaSystem.init() call into each test class/entry point to control when initialisation happens
@afs I'll see what I can do to make a reproducible example. I think it would have to be just a few threads spamming JenaSystem.init() until it jams...
I'm already running JenaSystem.init() in beforeAll of each test case, but I think scalatest runs these in parallel. I'll see if I can run something before any test case is executed.
Ok, got it.
@afs This reliably gets stuck every time I run it, with a very similar thread dump to what I posted above: https://github.com/Ostrzyciel/jena-default-graphs/blob/issue-2787/jena-default-graphs/src/main/java/org/example/Main.java
I'm not sure if this qualifies as a "bug", but it's surely pretty confusing to me.
P.S.: the reproduction code does not get stuck with Jena 5.1.0. Only with Jena 5.2.0.
Ok, got it.
@afs This reliably gets stuck every time I run it, with a very similar thread dump to what I posted above: https://github.com/Ostrzyciel/jena-default-graphs/blob/issue-2787/jena-default-graphs/src/main/java/org/example/Main.java
I'm not sure if this qualifies as a "bug", but it's surely pretty confusing to me.
We are encountering a chicken-and-egg problem with the ModelFactory
and JenaSystem.init()
.
The ModelFactory
has a static constructor call to JenaSystem.init();
. This means that ModelFactory
assumes the JenaSystem
has already been initialized and requires JenaSystem.init()
to be called first.
However, during the JenaSystem.init()
process, some code is executed that calls ModelFactory.createDefaultModel()
, which in turn assumes that ModelFactory
is already available, even though JenaSystem.init()
has not yet completed.
This circular dependency between JenaSystem
initialization and the ModelFactory
is causing the issue.
Unfortunately, I don’t yet have a clue how to fix it.
The code used before Jena 5.2 allowed only the first thread to start the initialization, but all subsequent threads did not wait because the JenaSystem
pretended to have already been initialized, even though that might not have been the case yet:
With Jena 5.2, we can now be sure that everything has been properly initialized.
Anecdotally I've definitely noticed similar issues with upgrading to 5.2.0, generally this is fixed by adding an explicit JenaSystem.init() call into each test class/entry point to control when initialisation happens
Test can always problematic because they mayt jump into the middle of the system.
Java synchronized blocks in Java are reentrant.
So it's two parallel threads, not a same-thread deadly embrace.
Ok, got it.
@afs This reliably gets stuck every time I run it, with a very similar thread dump to what I posted above: https://github.com/Ostrzyciel/jena-default-graphs/blob/issue-2787/jena-default-graphs/src/main/java/org/example/Main.java
I'm not sure if this qualifies as a "bug", but it's surely pretty confusing to me.
Thanks for the example - it can dealock here as well, and with less threads.
P.S.: the reproduction code does not get stuck with Jena 5.1.0. Only with Jena 5.2.0 It really is unsafe in 5.1.0 because the second thread jumps over initialization - that's why the NPE in #2675 happened.
Jena.System.init
needs to be cheap - otherwise it would be easier to make safe!
In an attempt to make the possible blockage and cause visible to the user, I may have inadvertently fixed it:
private static volatile boolean initialized = false ;
private static Semaphore initializationSemaphore = new Semaphore(1);
private static ThreadLocal<Boolean> isInitializing = ThreadLocal.withInitial(() -> false);
public static void init() {
// Once jena is initialized, all calls are an immediate return.
if ( initialized )
return ;
// if the same thread is calling init(), while the initialization is in progress, return immediately
if(isInitializing.get())
return ;
isInitializing.set(true); // set the flag to true to indicate that the initialization is in progress
try {
// Only one thread can initialize Jena.
if(initializationSemaphore.tryAcquire(5, TimeUnit.SECONDS)) {
if ( initialized )
return ;
setup();
if ( DEBUG_INIT )
singleton.debug(DEBUG_INIT);
singleton.initialize();
singleton.debug(false);
// Last so overlapping initialization waits on the synchronized
initialized = true;
}
} catch (InterruptedException e) {
throw new RuntimeException("Timeout while waiting for semaphore to initialize JenaSystem. Please call JenaSystem.init() before working with multiple threads.", e);
} finally {
initializationSemaphore.release();
isInitializing.remove();
}
}
@arne-bdt Throw up a PR for us to review then
@arne-bdt Throw up a PR for us to review then
Hmpf... I've started a new branch, but it's not working the way I saw it before... :-( --> No solution in sight yet.
One thing to consider (as well) is to remove the hidden model in all vocabularies and change schemagen.
Replace m_model.
calls with ResourceFactory.
and maybe class initializer for JenaSystem.init
.
A collection of constants don't need to be created off of a model. Even if want keep the hidden model it can be a minimal one, not ModelFactory.createDefaultModel
.
This shouldn't require external vocabularies to make the same change.
One thing to consider (as well) is to remove the hidden model in all vocabularies and change schemagen. Replace
m_model.
calls withResourceFactory.
and maybe class initializer forJenaSystem.init
.A collection of constants don't need to be created off of a model. Even if want keep the hidden model it can be a minimal one, not
ModelFactory.createDefaultModel
.This shouldn't require external vocabularies to make the same change.
It is not only the ModelFactory. This is the complete list of classes, where static { JenaSystem.init(); }
must be removed, to make the code sample work:
org.apache.jena.graph.NodeFactory
org.apache.jena.rdf.model.ModelFactory
org.apache.jena.rdf.model.impl.ModelCom
org.apache.jena.shared.PrefixMapping.Factory
This circular dependency between JenaSystem initialization and the ModelFactory is causing the issue.
yes - specifically, it is the class initialization lock being held by ModelFactory and any locking in JenaSystem.
With two threads, the deadlock is:
Thread1 is the thread calling ModelFactory.createDefaultModel Thread2 is the thread calling JenaSystem.init directly.
Thread1: lock order is take the ModelFactory class init lock, then take the JenaSystem sync lock. Thread2: lock order is take the JenaSystem sync then ModelFactory class init lock.
Prior to the 8654c07 change, thread2 would not take the ModelFactory class init lock inside synchonized but continued executing on partial initialization which works mostly .. except when it doesn't and it causes an NPE #2675.
It's a problem when using Jena as a library because of the multiple entry points doing class initlaization.
FWIW JenaSystem.init came in part because of class initialization circularity seeing infrequent uninitialized data even on one thread due to different class initialization orders.
Due to the complexity of the initialization process and the presence of circular dependencies, resolving the deadlock issue is not straightforward. If we can't clean up the initialization process and eliminate these circular dependencies, we're essentially left with two problematic variants:
NullPointerException (NPE) (prior to the https://github.com/apache/jena/commit/8654c07f23a892c03a7367981d814c3c8b094f62 change): If the user forgets to call JenaSystem.init() from a single-threaded context before using the library in a multithreaded environment, the application may encounter a NullPointerException. This can occur because certain components of the library remain uninitialized, leading to unexpected null references when accessed by multiple threads.
Deadlock (since Jena 5.2.0): Alternatively, if the library is used in a multithreaded context without proper initialization, a deadlock may occur due to threads waiting on class initialization locks and synchronized methods. This is particularly difficult to diagnose because the application may simply hang without any clear indication of the cause.
For the Deadlock variant, I implemented a fail-fast approach that throws an informative exception. (See update in PR #2792.)
I think the solution is to go back to #2675 and look to fix the class-initializer problem. That probably needs investigating anyway.
Then there is no occurence of possible alternative orders of taking locks which is the deadlock.
The #2675 / 8654c07 is a non-concurrency ordering problem in a single thread - i.e. a cyclic class initialization.
Requiring usage to only have single threaded initialization could be quite a problem in environments that have multi-threading and then use Jena as a library. It is not always so easy to have application code in the single-threaded startup phase.
I think requiring single threaded initialization will break existing usage. Until we agree that it is the only option, I don't think we should go that way.
Basically every usage, that would trigger an exeption with PR https://github.com/apache/jena/pull/2792, will end up in a deadlock with Jena 5.2.
What do we do with Jena 5.2 then? Do we withdraw the publication or do we issue a clear warning?
Yes, but the usage is unusual (scalatest).
We have workaround for users that need 5.2 and can put code in the single threaded stage and it is something that's been encouraged for quite awhile.
If fixing the single-threaded #2675 another way can be done, we do not need to have users' change.
If, in investigating #2675, we discover it can not be done, then we may have to require user change but that's a big step.
The practical difficulty is that any change is one we can't test completely because there is too large a variety of uses. (And also that people don't test pre-releases.)
What do we do with Jena 5.2 then? Do we withdraw the publication or do we issue a clear warning?
ASF, and open source generally, doesn't remove releases. We can put out a new version.
I am sorry that PR 8654c07 broke some usages.
Tentatively, GH-2675 seems to be fixed by just adding the RDFLinkHTTP without the JenaSystem.init change moving initlialized = true;
That is not full confirmation of a fix yet but it does change the behaviour of the thread undertaking the initialization because it no longer calls makes a nested call to singleton.initialize()
.
Addendum: that leaves the possibility of a thread seeing "initialized = true" and skipping the synchronized block, proceeding without initialization having completed.
None of the reports have been about this situation and it may be that it is hidden (initialized = true not having been written back even though volatile until the exit from synchronized which is a memory happens-before point) or impossible because of class initialization caused by singleton.initialize()
will hold the class initialization lock. It does suggest a robust solution with 2 volatiles (or AtomicBoolean
for better control) "before" and "after" the initialization work step.
... that leaves the possibility of a thread seeing "initialized = true" and skipping the synchronized block, proceeding without initialization having completed.
I thought that was the important part, that needed fixing by moving the initialized = true
.
--> So I updated my branch by moving the initialized = true
back to its old position and removed the exception:
https://github.com/apache/jena/compare/main...arne-bdt:jena:GH-2787-Deadlock_in_Jena_System_init
The test code does not cause deadlocks any more.
None of the reports have been about this situation...
Sounds great. I also tried to reproduce such a situation with the code that reproduces the deadlock. I could not convince a static field from the vocabulary to return null
.
Unfortunately I have no idea on how to test if issue https://github.com/apache/jena/issues/2675 still works, the PR https://github.com/apache/jena/pull/2677 does not seem to contain a new dedicated test.
It is difficult to test for because Jena is likely initlialized and any test will pass for that reason -- it isn't a test of class initialization order and we can not rely on the test.
If this is run from public static void main
, and there is no JenaSystem.ini
t call in RDFLinkHTTPBuilder
, it should go wrong.
public static void mainNPE2675() {
System.out.println("GH-2675: RDFLinkHTTP NPE");
try (RDFConnection conn = RDFConnectionFuseki.service("http://localhost:3030/ds").build()) {
}
System.out.println("DONE");
}
It is difficult to test for because Jena is likely initlialized and any test will pass for that reason -- it isn't a test of class initialization order and we can not rely on the test.
If this is run from
public static void main
, and there is noJenaSystem.ini
t call inRDFLinkHTTPBuilder
, it should go wrong.public static void mainNPE2675() { System.out.println("GH-2675: RDFLinkHTTP NPE"); try (RDFConnection conn = RDFConnectionFuseki.service("http://localhost:3030/ds").build()) { } System.out.println("DONE"); }
Great: It runs perfectly. But when static { JenaSystem.init(); }
is removed in RDFLinkHTTPBuilder, it fails with:
java.lang.ExceptionInInitializerError ... Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.jena.rdf.model.Resource.asNode()" because "org.apache.jena.vocabulary.RDF.Alt" is null
In my branch https://github.com/apache/jena/compare/main...arne-bdt:jena:GH-2787-Deadlock_in_Jena_System_init, I observerd something strage:
I thought, I could make the solution more robust by using static AtomicBoolean initialized = new AtomicBoolean(false);
instead of static boolean initialized = false;
But instead, this change reliably takes me straight back to the reproducible deadlock situation.
So I reverted that experment.
Is the LazyHolder
any different to the 5.1.0 approach in SystemJena.init
? A second thread coming is not waiting for initlialization to complete and can touch classes in a different order. (we don't have a reproducible example.)
Is the
LazyHolder
approach any different from the 5.1.0 approach inSystemJena.init
? A second thread coming in is not waiting for initialization to complete and may access classes in a different order. (We don't have a reproducible example.)
The LazyHolder
replaces only the synchronized(JenaSystem.class)
mechanism. Unlike the synchronized block, the JVM ensures that the LazyHolder
is only accessed once, so there’s no need for double-checking with if (initialized) return;
. This makes it both efficient and more robust.
When running org.apache.jena.sys.TestJenaSystem#testFirstInitParallel
directly from the IDE (without prior initialization of JenaSystem
), it’s quite straightforward to observe the exact behavior, and the test consistently produces reliable results. So far, I haven’t encountered any non-reproducible behavior (e.g., due to timing issues).
Which kind of reproducible example are you missing?
However, I have not yet figured out how to integrate this test to run it with Maven in a separate JVM instance to ensure that the JenaSystem
has not been initialised.
For more information on the pattern, you can refer to the documentation here and here.
However, I have not yet figured out how to integrate this test to run it with Maven in a separate JVM instance to ensure that the JenaSystem has not been initialised.
Update: Using the JMH framework, it was easy to implement a test that always starts with a fresh JVM.
On that basis, I could run many iterations and stress it even more with threads. --> The solution in https://github.com/apache/jena/compare/main...arne-bdt:jena:GH-2787-Deadlock_in_Jena_System_init seems to be stable.
The
LazyHolder
replaces only thesynchronized(JenaSystem.class)
mechanism. Unlike the synchronized block, the JVM ensures that theLazyHolder
is only accessed once, so there’s no need for double-checking withif (initialized) return;
. This makes it both efficient and more robust.
(Because this is critical in Jena, forgive me for wanting to get into the details.)
I don't quite understand "more robust" here. The second thread is allowed to continue once initialized is set true. LazyHolder is (I don't have a solution for that and it may be a non-issue because the second thread bumps into an initializing class.)
Efficiency - in our case, there isn't much different because the volatile is still there.
(Because this is critical in Jena, forgive me for wanting to get into the details.)
I don't quite understand "more robust" here. The second thread is allowed to continue once initialized is set true. LazyHolder is (I don't have a solution for that and it may be a non-issue because the second thread bumps into an initializing class.)
@afs This pattern is a "cleaner" approach to static initialization, as it delegates synchronization responsibility to the JVM. It also guarantees the code executes only once. To replicate what the JVM does for us, we currently use synchronized
and a secondary check within the synchronized block for any thread that reaches that point and is waiting:
synchronized(JenaSystem.class) {
if ( initialized )
return ;
...
}
I understand that introducing this pattern may have caused some confusion, as it does not directly contribute to solving the problem. The actual fix was simply moving initialized = true;
back to its original position.
Since I prefer to clean up code whenever I need to modify it, I introduced the pattern. We are early in the Jena 5.3 cycle, and we have code to reproduce the threading issues along with solid tests. I ran numerous test iterations that demonstrated stable behavior. Given this, I valued the benefits of the initialization pattern over the potential risk of a different runtime behavior.
I have my experience with threading and I like challenges. So I jumped on this bug, which could also be relevant for some of my work projects. But now I am at a loss.
Restoring the early return from JenaSystem.init() as implemented prior to commit 8654c07 does not solve the deadlock bug.
It only reduces the time-window and thus the likelihood of an occurrence. The developed tests clearly show that the problem persists. So here we are again with two options:
Class initlializtion spec: https://docs.oracle.com/javase/specs/jls/se21/html/jls-12.html#jls-12.4.2
The LazyHolder
has an advantage here because it is part of the overall class initialization process, the usual way applications using Jena as a library will cause JenaSystem.init()
to be called.
Following @arne-bdt's proposal, PR #2807 puts in the lazy holder pattern and moves the setting initialization to the start of the process. It passes the deadlock test case at he top of this issue.
If we all agree, putting that in unlocks further testing of scalatest in case more issues turn up.
The PR is an improvement - testing (in the build and in usage) can follow.
@afs Thank you all for putting all the effort in solving this! If the code passes the deadlock case, then I'm pretty sure it's alright. I do not fully understand how it works, so I can't comment on the details of the solution.
Version
5.2.0
What happened?
After upgrading to Jena 5.2.0, I've started noticing that some of my CI test runs were randomly just getting stuck. After a debugging session with sbt on my machine, I think this might be a deadlock in the initialization code. See the attached thread dump.
The first thread in the dump is within JenaSystem.init(), so it's doing something to initialize Jena. It's stuck on this line: https://github.com/apache/jena/blob/6443abda6e2717b95b05c45515817584e93ef244/jena-core/src/main/java/org/apache/jena/vocabulary/DC_11.java#L29
To run this line, JVM has to run the static initializer of ModelFactory, which includes a call to JenaSystem.init(): https://github.com/apache/jena/blob/6443abda6e2717b95b05c45515817584e93ef244/jena-core/src/main/java/org/apache/jena/rdf/model/ModelFactory.java#L47-L49
Oops?
The other thread in the dump is just waiting on the first one.
I'm not sure how the code got in this state, but I'm using sbt and scalatest, which absolutely love making lots of parallel calls to code that maybe shouldn't be called in parallel. Maybe that is the issue, but I had no problems with Jena 5.1.0.
I think this issue may be related to this change: https://github.com/apache/jena/commit/8654c07f23a892c03a7367981d814c3c8b094f62
Relevant output and stacktrace
Are you interested in making a pull request?
This seems to be above my pay grade. No idea what's happening here.