apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.1k stars 650 forks source link

Deadlock in JenaSystem.init() ? #2787

Open Ostrzyciel opened 3 days ago

Ostrzyciel commented 3 days ago

Version

5.2.0

What happened?

After upgrading to Jena 5.2.0, I've started noticing that some of my CI test runs were randomly just getting stuck. After a debugging session with sbt on my machine, I think this might be a deadlock in the initialization code. See the attached thread dump.

The first thread in the dump is within JenaSystem.init(), so it's doing something to initialize Jena. It's stuck on this line: https://github.com/apache/jena/blob/6443abda6e2717b95b05c45515817584e93ef244/jena-core/src/main/java/org/apache/jena/vocabulary/DC_11.java#L29

To run this line, JVM has to run the static initializer of ModelFactory, which includes a call to JenaSystem.init(): https://github.com/apache/jena/blob/6443abda6e2717b95b05c45515817584e93ef244/jena-core/src/main/java/org/apache/jena/rdf/model/ModelFactory.java#L47-L49

Oops?

The other thread in the dump is just waiting on the first one.

I'm not sure how the code got in this state, but I'm using sbt and scalatest, which absolutely love making lots of parallel calls to code that maybe shouldn't be called in parallel. Maybe that is the issue, but I had no problems with Jena 5.1.0.

I think this issue may be related to this change: https://github.com/apache/jena/commit/8654c07f23a892c03a7367981d814c3c8b094f62

Relevant output and stacktrace

"pool-18-thread-17@31723" prio=5 tid=0x161 nid=NA runnable
  java.lang.Thread.State: RUNNABLE
     blocks pool-18-thread-5@31711
     blocks pool-18-thread-8@31714
     blocks pool-18-thread-9@31715
      at org.apache.jena.vocabulary.DC_11.<clinit>(DC_11.java:29)
      at org.apache.jena.shared.PrefixMapping.<clinit>(PrefixMapping.java:230)
      at org.apache.jena.sparql.ARQConstants.<clinit>(ARQConstants.java:120)
      at org.apache.jena.query.ARQ.init(ARQ.java:652)
      - locked <0x7bff> (a java.lang.Object)
      at org.apache.jena.sparql.system.InitARQ.start(InitARQ.java:29)
      at org.apache.jena.base.module.Subsystem.lambda$initialize$1(Subsystem.java:117)
      at org.apache.jena.base.module.Subsystem$$Lambda$12363/0x00007fe8d9c5a6f0.accept(Unknown Source:-1)
      at java.util.ArrayList.forEach(ArrayList.java:1511)
      at org.apache.jena.base.module.Subsystem.forEach(Subsystem.java:193)
      - locked <0x7bf9> (a org.apache.jena.base.module.Subsystem)
      - locked <0x72e> (a java.lang.Class)
      at org.apache.jena.base.module.Subsystem.forEach(Subsystem.java:169)
      at org.apache.jena.base.module.Subsystem.initialize(Subsystem.java:115)
      at org.apache.jena.sys.JenaSystem.init(JenaSystem.java:90)
      - locked <0x73c> (a java.lang.Class)
      at org.apache.jena.riot.RDFDataMgr.<clinit>(RDFDataMgr.java:107)

"pool-18-thread-8@31714" prio=5 tid=0x158 nid=NA waiting for monitor entry
  java.lang.Thread.State: BLOCKED
     waiting for pool-18-thread-17@31723 to release lock on <0x73c> (a java.lang.Class)
      at org.apache.jena.sys.JenaSystem.init(JenaSystem.java:85)
      at org.apache.jena.rdf.model.ModelFactory.<clinit>(ModelFactory.java:49)

Are you interested in making a pull request?

This seems to be above my pay grade. No idea what's happening here.

afs commented 3 days ago

Do you have a reproducible - in java! - example?

The change at 8654c07 is for a reason - see GH-2675. It is supposed to hold up concurrent initialization because that causes an NPE.

For scalatest, it would be better to explicitly call JenaSystem.init before things go parallel.

rvesse commented 3 days ago

Anecdotally I've definitely noticed similar issues with upgrading to 5.2.0, generally this is fixed by adding an explicit JenaSystem.init() call into each test class/entry point to control when initialisation happens

Ostrzyciel commented 3 days ago

@afs I'll see what I can do to make a reproducible example. I think it would have to be just a few threads spamming JenaSystem.init() until it jams...

I'm already running JenaSystem.init() in beforeAll of each test case, but I think scalatest runs these in parallel. I'll see if I can run something before any test case is executed.

Ostrzyciel commented 3 days ago

Ok, got it.

@afs This reliably gets stuck every time I run it, with a very similar thread dump to what I posted above: https://github.com/Ostrzyciel/jena-default-graphs/blob/issue-2787/jena-default-graphs/src/main/java/org/example/Main.java

I'm not sure if this qualifies as a "bug", but it's surely pretty confusing to me.

Ostrzyciel commented 3 days ago

P.S.: the reproduction code does not get stuck with Jena 5.1.0. Only with Jena 5.2.0.

arne-bdt commented 3 days ago

Ok, got it.

@afs This reliably gets stuck every time I run it, with a very similar thread dump to what I posted above: https://github.com/Ostrzyciel/jena-default-graphs/blob/issue-2787/jena-default-graphs/src/main/java/org/example/Main.java

I'm not sure if this qualifies as a "bug", but it's surely pretty confusing to me.

We are encountering a chicken-and-egg problem with the ModelFactory and JenaSystem.init().

The ModelFactory has a static constructor call to JenaSystem.init();. This means that ModelFactory assumes the JenaSystem has already been initialized and requires JenaSystem.init() to be called first. image

However, during the JenaSystem.init() process, some code is executed that calls ModelFactory.createDefaultModel(), which in turn assumes that ModelFactory is already available, even though JenaSystem.init() has not yet completed. image

This circular dependency between JenaSystem initialization and the ModelFactory is causing the issue.

Unfortunately, I don’t yet have a clue how to fix it.

arne-bdt commented 3 days ago

The code used before Jena 5.2 allowed only the first thread to start the initialization, but all subsequent threads did not wait because the JenaSystem pretended to have already been initialized, even though that might not have been the case yet:

image

With Jena 5.2, we can now be sure that everything has been properly initialized.

afs commented 3 days ago

Anecdotally I've definitely noticed similar issues with upgrading to 5.2.0, generally this is fixed by adding an explicit JenaSystem.init() call into each test class/entry point to control when initialisation happens

Test can always problematic because they mayt jump into the middle of the system.

afs commented 3 days ago

Java synchronized blocks in Java are reentrant.

So it's two parallel threads, not a same-thread deadly embrace.

afs commented 3 days ago

Ok, got it.

@afs This reliably gets stuck every time I run it, with a very similar thread dump to what I posted above: https://github.com/Ostrzyciel/jena-default-graphs/blob/issue-2787/jena-default-graphs/src/main/java/org/example/Main.java

I'm not sure if this qualifies as a "bug", but it's surely pretty confusing to me.

Thanks for the example - it can dealock here as well, and with less threads.

P.S.: the reproduction code does not get stuck with Jena 5.1.0. Only with Jena 5.2.0 It really is unsafe in 5.1.0 because the second thread jumps over initialization - that's why the NPE in #2675 happened.

Jena.System.init needs to be cheap - otherwise it would be easier to make safe!

arne-bdt commented 3 days ago

In an attempt to make the possible blockage and cause visible to the user, I may have inadvertently fixed it:

private static volatile boolean initialized = false ;
private static Semaphore initializationSemaphore = new Semaphore(1);
private static ThreadLocal<Boolean> isInitializing = ThreadLocal.withInitial(() -> false);

public static void init() {
    // Once jena is initialized, all calls are an immediate return.
    if ( initialized )
        return ;

    // if the same thread is calling init(), while the initialization is in progress, return immediately
    if(isInitializing.get())
        return ;

    isInitializing.set(true); // set the flag to true to indicate that the initialization is in progress

    try {
        // Only one thread can initialize Jena.
        if(initializationSemaphore.tryAcquire(5, TimeUnit.SECONDS)) {
            if ( initialized )
                return ;
            setup();
            if ( DEBUG_INIT )
                singleton.debug(DEBUG_INIT);
            singleton.initialize();
            singleton.debug(false);
            // Last so overlapping initialization waits on the synchronized
            initialized = true;
        }
    } catch (InterruptedException e) {
        throw new RuntimeException("Timeout while waiting for semaphore to initialize JenaSystem. Please call JenaSystem.init() before working with multiple threads.", e);
    } finally {
        initializationSemaphore.release();
        isInitializing.remove();
    }
}
rvesse commented 3 days ago

@arne-bdt Throw up a PR for us to review then

arne-bdt commented 3 days ago

@arne-bdt Throw up a PR for us to review then

Hmpf... I've started a new branch, but it's not working the way I saw it before... :-( --> No solution in sight yet.

afs commented 3 days ago

One thing to consider (as well) is to remove the hidden model in all vocabularies and change schemagen. Replace m_model. calls with ResourceFactory. and maybe class initializer for JenaSystem.init.

A collection of constants don't need to be created off of a model. Even if want keep the hidden model it can be a minimal one, not ModelFactory.createDefaultModel.

This shouldn't require external vocabularies to make the same change.

arne-bdt commented 3 days ago

One thing to consider (as well) is to remove the hidden model in all vocabularies and change schemagen. Replace m_model. calls with ResourceFactory. and maybe class initializer for JenaSystem.init.

A collection of constants don't need to be created off of a model. Even if want keep the hidden model it can be a minimal one, not ModelFactory.createDefaultModel.

This shouldn't require external vocabularies to make the same change.

It is not only the ModelFactory. This is the complete list of classes, where static { JenaSystem.init(); } must be removed, to make the code sample work:

afs commented 2 days ago

This circular dependency between JenaSystem initialization and the ModelFactory is causing the issue.

yes - specifically, it is the class initialization lock being held by ModelFactory and any locking in JenaSystem.

With two threads, the deadlock is:

Thread1 is the thread calling ModelFactory.createDefaultModel Thread2 is the thread calling JenaSystem.init directly.

Thread1: lock order is take the ModelFactory class init lock, then take the JenaSystem sync lock. Thread2: lock order is take the JenaSystem sync then ModelFactory class init lock.

Prior to the 8654c07 change, thread2 would not take the ModelFactory class init lock inside synchonized but continued executing on partial initialization which works mostly .. except when it doesn't and it causes an NPE #2675.

It's a problem when using Jena as a library because of the multiple entry points doing class initlaization.

FWIW JenaSystem.init came in part because of class initialization circularity seeing infrequent uninitialized data even on one thread due to different class initialization orders.

arne-bdt commented 2 days ago

Due to the complexity of the initialization process and the presence of circular dependencies, resolving the deadlock issue is not straightforward. If we can't clean up the initialization process and eliminate these circular dependencies, we're essentially left with two problematic variants:

  1. NullPointerException (NPE) (prior to the https://github.com/apache/jena/commit/8654c07f23a892c03a7367981d814c3c8b094f62 change): If the user forgets to call JenaSystem.init() from a single-threaded context before using the library in a multithreaded environment, the application may encounter a NullPointerException. This can occur because certain components of the library remain uninitialized, leading to unexpected null references when accessed by multiple threads.

  2. Deadlock (since Jena 5.2.0): Alternatively, if the library is used in a multithreaded context without proper initialization, a deadlock may occur due to threads waiting on class initialization locks and synchronized methods. This is particularly difficult to diagnose because the application may simply hang without any clear indication of the cause.

For the Deadlock variant, I implemented a fail-fast approach that throws an informative exception. (See update in PR #2792.)

afs commented 2 days ago

I think the solution is to go back to #2675 and look to fix the class-initializer problem. That probably needs investigating anyway.

Then there is no occurence of possible alternative orders of taking locks which is the deadlock.

The #2675 / 8654c07 is a non-concurrency ordering problem in a single thread - i.e. a cyclic class initialization.

Requiring usage to only have single threaded initialization could be quite a problem in environments that have multi-threading and then use Jena as a library. It is not always so easy to have application code in the single-threaded startup phase.

arne-bdt commented 2 days ago

I think requiring single threaded initialization will break existing usage. Until we agree that it is the only option, I don't think we should go that way.

Basically every usage, that would trigger an exeption with PR https://github.com/apache/jena/pull/2792, will end up in a deadlock with Jena 5.2.

What do we do with Jena 5.2 then? Do we withdraw the publication or do we issue a clear warning?

afs commented 2 days ago

Yes, but the usage is unusual (scalatest).

We have workaround for users that need 5.2 and can put code in the single threaded stage and it is something that's been encouraged for quite awhile.

If fixing the single-threaded #2675 another way can be done, we do not need to have users' change.

If, in investigating #2675, we discover it can not be done, then we may have to require user change but that's a big step.

The practical difficulty is that any change is one we can't test completely because there is too large a variety of uses. (And also that people don't test pre-releases.)

What do we do with Jena 5.2 then? Do we withdraw the publication or do we issue a clear warning?

ASF, and open source generally, doesn't remove releases. We can put out a new version.

I am sorry that PR 8654c07 broke some usages.

afs commented 2 days ago

Tentatively, GH-2675 seems to be fixed by just adding the RDFLinkHTTP without the JenaSystem.init change moving initlialized = true;

That is not full confirmation of a fix yet but it does change the behaviour of the thread undertaking the initialization because it no longer calls makes a nested call to singleton.initialize().

Addendum: that leaves the possibility of a thread seeing "initialized = true" and skipping the synchronized block, proceeding without initialization having completed.

None of the reports have been about this situation and it may be that it is hidden (initialized = true not having been written back even though volatile until the exit from synchronized which is a memory happens-before point) or impossible because of class initialization caused by singleton.initialize() will hold the class initialization lock. It does suggest a robust solution with 2 volatiles (or AtomicBoolean for better control) "before" and "after" the initialization work step.

arne-bdt commented 1 day ago

... that leaves the possibility of a thread seeing "initialized = true" and skipping the synchronized block, proceeding without initialization having completed.

I thought that was the important part, that needed fixing by moving the initialized = true.

--> So I updated my branch by moving the initialized = true back to its old position and removed the exception: https://github.com/apache/jena/compare/main...arne-bdt:jena:GH-2787-Deadlock_in_Jena_System_init The test code does not cause deadlocks any more.

None of the reports have been about this situation...

Sounds great. I also tried to reproduce such a situation with the code that reproduces the deadlock. I could not convince a static field from the vocabulary to return null .

Unfortunately I have no idea on how to test if issue https://github.com/apache/jena/issues/2675 still works, the PR https://github.com/apache/jena/pull/2677 does not seem to contain a new dedicated test.

afs commented 1 day ago

It is difficult to test for because Jena is likely initlialized and any test will pass for that reason -- it isn't a test of class initialization order and we can not rely on the test.

If this is run from public static void main, and there is no JenaSystem.init call in RDFLinkHTTPBuilder, it should go wrong.

   public static void mainNPE2675() {
        System.out.println("GH-2675: RDFLinkHTTP NPE");
        try (RDFConnection conn = RDFConnectionFuseki.service("http://localhost:3030/ds").build()) {

        }
        System.out.println("DONE");
    }
arne-bdt commented 1 day ago

It is difficult to test for because Jena is likely initlialized and any test will pass for that reason -- it isn't a test of class initialization order and we can not rely on the test.

If this is run from public static void main, and there is no JenaSystem.init call in RDFLinkHTTPBuilder, it should go wrong.

   public static void mainNPE2675() {
        System.out.println("GH-2675: RDFLinkHTTP NPE");
        try (RDFConnection conn = RDFConnectionFuseki.service("http://localhost:3030/ds").build()) {

        }
        System.out.println("DONE");
    }

Great: It runs perfectly. But when static { JenaSystem.init(); } is removed in RDFLinkHTTPBuilder, it fails with: java.lang.ExceptionInInitializerError ... Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.jena.rdf.model.Resource.asNode()" because "org.apache.jena.vocabulary.RDF.Alt" is null

arne-bdt commented 1 day ago

In my branch https://github.com/apache/jena/compare/main...arne-bdt:jena:GH-2787-Deadlock_in_Jena_System_init, I observerd something strage: I thought, I could make the solution more robust by using static AtomicBoolean initialized = new AtomicBoolean(false); instead of static boolean initialized = false;

But instead, this change reliably takes me straight back to the reproducible deadlock situation.

So I reverted that experment.

afs commented 1 day ago

Is the LazyHolder any different to the 5.1.0 approach in SystemJena.init? A second thread coming is not waiting for initlialization to complete and can touch classes in a different order. (we don't have a reproducible example.)

arne-bdt commented 1 day ago

Is the LazyHolder approach any different from the 5.1.0 approach in SystemJena.init? A second thread coming in is not waiting for initialization to complete and may access classes in a different order. (We don't have a reproducible example.)

The LazyHolder replaces only the synchronized(JenaSystem.class) mechanism. Unlike the synchronized block, the JVM ensures that the LazyHolder is only accessed once, so there’s no need for double-checking with if (initialized) return;. This makes it both efficient and more robust.

When running org.apache.jena.sys.TestJenaSystem#testFirstInitParallel directly from the IDE (without prior initialization of JenaSystem), it’s quite straightforward to observe the exact behavior, and the test consistently produces reliable results. So far, I haven’t encountered any non-reproducible behavior (e.g., due to timing issues). Which kind of reproducible example are you missing?

However, I have not yet figured out how to integrate this test to run it with Maven in a separate JVM instance to ensure that the JenaSystem has not been initialised.

For more information on the pattern, you can refer to the documentation here and here.

arne-bdt commented 1 day ago

However, I have not yet figured out how to integrate this test to run it with Maven in a separate JVM instance to ensure that the JenaSystem has not been initialised.

Update: Using the JMH framework, it was easy to implement a test that always starts with a fresh JVM.

On that basis, I could run many iterations and stress it even more with threads. --> The solution in https://github.com/apache/jena/compare/main...arne-bdt:jena:GH-2787-Deadlock_in_Jena_System_init seems to be stable.