TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug [LUCENE-3235]

asfimport commented 13 years ago

Not sure what's going on yet... but under Java 1.6 it seems not to hang bug under Java 1.5 hangs fairly easily, on Linux. Java is 1.5.0_22.

I suspect this is relevant: http://stackoverflow.com/questions/3292577/is-it-possible-for-concurrenthashmap-to-deadlock which refers to this JVM bug http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6865591 which then refers to this one http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6822370

It looks like that last bug was fixed in Java 1.6 but not 1.5.

Migrated from LUCENE-3235 by Michael McCandless (@mikemccand), updated May 09 2016 Attachments: LUCENE-3235.patch (versions: 3) Linked issues:

4648

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

+1 to drop java 5

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

LOL, no comment.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

i ran the test with the same version as mike (1.5.0_22) in two ways on windows:

-Dtests.iter=100
in a loop from a script, 100 times with its own ant run.

i can't reproduce it on windows.

in my eyes, there isn't even an argument about whether or not we should support java5: its not possible, if bugs are not getting fixed.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Still hangs if I run -client; but it looks like -Xint prevents the hang (235 iterations so far on beast).

3.2 also hangs.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mike, i installed 1.5.0_22 (amd64) on my linux machine, and i can't reproduce there either (i ran like 500 iterations).

Maybe my hardware isn't concurrent enough? or maybe you should un-overclock? :)

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

VERY interesting! Is anyone able to repro this hang besides me...?

asfimport commented 13 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I don't think you can force ~~client if it's a 64 bit release and you have tons of memory, can you? You can check by running java -client -version -~~ this is what it tells me, for example:

dweiss@dweiss-linux:\~/work/lucene/lucene-trunk$ java -client -version
java version "1.6.0_16"
Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode)

Can you do a remote stack of all the VM (or run it from the console and send it a signal to dump all threads)?

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Yes the stack looks just like the stack overflow link I posted – several threads stuck in sun.misc.Unsafe.park ;)

java -Xint definitely does not hang... ran for like 4200 iterations.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Indeed java ~~client -version shows it's still using server VM -~~ you're right!

asfimport commented 13 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I'm same as Robert: +1 to drop 1.5...

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

+1 to drop 1.5...

+1.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

we moved to 1.6 on trunk seems we can't do much about it on 3.x - folks should run their stuff on 1.6 jvms or newer

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

wait, this statement makes no sense.

if 1.5 is no longer supported, then 1.5 should no longer be supported, and we should be free to use 1.6 code everywhere.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I agree with Robert. This issue is still existent in 3.x as we officially support Java 5.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

well then we should fix it - I will mark it as 3.5

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

An easy fix would be to use Collections.synchronizedMap(new HashMap()) in the ctor to initializer cache1 and cache2 (if Java 5 is detected)? If people are using Java 5 they get not-the best-performance.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I like Uwe's idea: not-the-best-performance is far preferable to a hang/deadlock!!!!!

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I am currently preparing a patch.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Patch.

We should forward port the deprecation/removal of useless Constants.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

An easy fix would be to use Collections.synchronizedMap(new HashMap()) in the ctor to initializer cache1 and cache2 (if Java 5 is detected)? If people are using Java 5 they get not-the best-performance.

I like that too...

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Updated patch after #4648 was committed. I also added a System.out.println to the test (VERBOSE only).

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I wait until tomorrow before I commit this "safe-but-slow" fix.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 for the safe-but-slow Java 5 only workaround....

asfimport commented 12 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

+0

if anyone else suggested that we should add a "slow hack" to work around a Sun JRE bug i would argue that we were being unfair to people using other JRE (ie: does IBM's JRE have this bug? – do IBM java 1.5 users deserve slower performance because Sun's JRE has a bug?) but since rmuir is the biggest proponent I know of not assuming everyone on the planet uses Sun JREs, and he's signed off on this, I'll defer.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

We are using ConcurrentHashMap also at other places, should we replace all of them or where is the bug that this happens only here?

It also appears to happen on Mike's machine, so maybe its hardware-related (Solaris?) as the Sun bugreport seems to tell us.

I am also +0 to apply the patch. I just showed one possibility how to fix this.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hoss you are right: we should also check Constants.SUN ?

otherwise lets not do the hack...

But i'm for the change because there is nothing slower than a hang/deadlock...

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

...and Solaris? The JVM BUG seems to only affect Solaris (according to the sun reports).

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

and intel cpu

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I modified the patch, the thing is that Mike was seeing this on Linux I think too...

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

here is a very detailed writeup for this: http://blogs.oracle.com/dave/entry/a_race_in_locksupport_park

some interesting facts:

The problem would only manifest when we were using the -UseMembar optimization that lets us remove fences from certain hot thread state transitions paths that need to coordinate safepoints between mutator threads and the JVM. This feature is enabled by default, but we can turn it off with the -XX:+UseMembar switch, which causes the JVM to emit normal fence instructions in the state transitions paths.
The bug is a "day-one" bug and present in all versions of HotSpot.
Parker::park() and unpark() reside in os_linux.cpp, os_solaris.cpp and os_windows.cpp for Linux, Solaris and Windows, respectively.
The built-in synchronized implementation uses a different park mechanism (PlatformPark::) whereas the java.util.concurrent infrastructure uses Parker::. Only Parker:: is vulnerable.
The bug will not manifest on uniprocessors or environments where threads are otherwise constrained to just a single processor.

I think the only reasonable fix for this is to recommend people to use -XX:+UseMembar if they are running on a vulnerable JVM

simon

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

That blog is as cool as the generics policeman ones... :-)

Thanks Simon, I think we should list this bug and its workaround in the wiki page and close this report.

Mike can you try -XX:+UseMembar ?

Uwe

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Thanks Simon, I think we should list this bug and its workaround in the wiki page and close this report.

+1 this is not our problem. if we go and fix all java.util.concurrent uses in solr & lucene we gonna end up in a big mess. According to the oracle blog this is also in 1.6 jvms and you will be vulnderable if you use any CHM like classes in your own code...

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I updated the wiki...

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

That's a nice blog post! What a scary platform-specific JVM bug...

I still hit hit this hang reasonably often when running 3.x tests. It's always the DBLRU cache, so far anyway.

Because this is our most intense use of a CHM... I still think the workaround (scoped down to 1.5, Sun JVM, little endian arch) makes sense? I agree it won't fully work around the JVM bug, since in theory other uses of java.util.concurrent.* could hit it, but it can prevent the most common occurrence? The patch seems minimal and worth it... a hang is a truly awful.

asfimport commented 12 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I still think the workaround (scoped down to 1.5, Sun JVM, little endian arch) makes sense?

+1 Doesn't hurt other JVMs, improves things on the Sun JVM (something that hangs a lot and then does not hang is a big improvement in my book), and putting a workaround command to use in the wiki just seems a whole lot less user friendly to me. It doesn't mean we have to try and address every use of java.util.concurrent to work around this specific issue, does it?

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

This testcase fails, but we are using concurrent also in ParallelMultiSearcher (die, die, die) and other places (even the indexer was partly upgraded to use ConcurrentLock). In my opinion we should not change our code to work around that issue. Just because one test case hangs its not guaranteed that other uses will work correctly. It brings a false security and slows down VMs that work correctly. And it only affects very modern processors.

If we would have a logging framework in Lucene (maybe Solr could do this): It could parse the args of Java (from system property) and look for -XX:+UseMembar, if its Java 1.5 it should print a warning to Solr/Lucene log file.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

In my opinion we should not change our code to work around that issue.

In general, I think we should change our code to work around awful JVM bugs, as long as 1) it's not so much effort for us to to do so (and as always a volunteer steps up to the task), and 2) the change has negligible cost to "lucky" users (who use a JVM / the right flags that would not have hit the JVM bug).

I think the last patch fits these criteria, since it's a tiny change and it scopes the workaround?

We've done this many times in the past; if the cost to "lucky" users is negligible and the benefit to "unlucky" users (unknowingly using the affected JVMs) is immense (not hitting horrific bug), I think the tradeoff is worthwhile? Otherwise users will conclude Lucene (or whatever software is embedding it) is buggy.

This testcase fails, but we are using concurrent also in ParallelMultiSearcher (die, die, die) and other places (even the indexer was partly upgraded to use ConcurrentLock).

Right, we use concurrent* elsewhere, but terms dict is the big user... very few apps actually use PMS.

It brings a false security and slows down VMs that work correctly.

Well, we already have "false security" that Lucene won't hang on any JVM... we don't claim this patch will fully work around the bug, but at least it should reduce it.

How are we slowing down other VMs...? We scope the workaround?

I'm not saying we should go crazy here, making a big patch to avoid concurrent* everywhere, but the current patch is minimal, addresses the big usage of concurrent* in 3.x, is scoped down well.

It will avoid hangs for some number unlucky users out there... so why not commit it?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

How are we slowing down other VMs...? We scope the workaround?

I agree, according to the bug report it seems all 1.5's are affected and some 1.6's? Doesn't seem like solaris is related either, especially since Mike hit it on linux

So the current patch is actually way under-scoped.

Sure, some 1.6's are affected, and if we want it to be even better, we should likely improve Constants a little bit to make the minor version more easily accessible, from the bug report it seems we should at least consider doing something for < 1.6.0u21 ? And we should remove the 'Solaris' check, but keep it little-endian because the bug report mentions its way more likely to happen on those cpus.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Yes, also 1.6.0_17 is affected. As always, 1.6.0_18 is the last and only good JVM :-)

My problem with the patch is that it only affects very few users, most users will have a working environment also with broken JVMs. The fix in the patch is very heavy, as, if we apply it correctly, will also slowdown <1.6.0_18.

As I said before, we should at least instruct Solr to print a WARN in the log if a JVM < 1.6.0_18 is started and the JVM parameter -XX:+UseMembar is missing. In Lucene we have no way to tell this the user as we have no logging framework, alternatively we could throw an Error is one of the central classes in Lucene is loaded by classloader and the JVM parameter is not given (static initializer e.g. in Constants.java). The same way we could tell the user: Dont use Java 7 GA.

As far as I know, the JVM command can be checked with a System-property and a simple regex should help.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

This ois the way to get the runtime args: http://download.oracle.com/javase/1.5.0/docs/api/java/lang/management/RuntimeMXBean.html#getInputArguments():

RuntimeMXBean m = ManagementFactory.getRuntimeMXBean();
List<String> vmargs = m.getInputArguments();

This prints all args for me:

[junit] vmargs=[-Xmx512M, -Dtests.verbose=false, -Dtests.infostream=false, -Dtests.lockdir=C:\Users\Uwe, Schindler\Projects\lucene\trunk-lusolr1\lucene\build, -Dtests.codec=random, -Dtests.postingsformat=random, -Dtests.locale=random, -Dtests.timezone=random, -Dtests.directory=random, -Dtests.linedocsfile=europarl.lines.txt.gz, -Dtests.iter=1, -Dtests.iter.min=1, -Dtests.seed=random, -Dtests.luceneMatchVersion=4.0, -Dtests.cleanthreads=perMethod, -Djava.util.logging.config.file=/dev/null, -Dtests.nightly=false, -Dtests.asserts.gracious=false, -Dtests.multiplier=1, -DtempDir=C:\Users\Uwe, Schindler\Projects\lucene\trunk-lusolr1\lucene\build\test\1, -Dlucene.version=4.0-SNAPSHOT, -Dtestmethod=, -Djetty.testMode=1, -Djetty.insecurerandom=1, -Dsolr.directoryFactory=org.apache.solr.core.MockDirectoryFactory, -ea:org.apache.lucene..., -ea:org.apache.solr...]

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Maybe we should remove this global cache completely?

If the whole point is to prevent double-seeks for queries doing IDF then getting termdocs, we can just put a tiny LRU cache in the already-existing threadlocal...

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

actually i think even a tiny queue would work.

this would prevent the double-seeks. I dont think there is a point to worrying about anything else.

the LRU-ness seems stupid: for 'common' terms that appear over and over like stopwords, those are gonna be slow anyway.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Robert, I agree, a simple LRUCache based on LinkedHashMap with

`@Override`
protected boolean removeEldestEntry(Map.Entry<K,V> eldest) {
  return size() > MAX_ENTRIES;
}

would be fine. Because of this bugs and heavy use in Solr of other concurrent classes, we should maybe add a warning to the startup based on my findings before.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

If the whole point is to prevent double-seeks for queries doing IDF then getting termdocs, we can just put a tiny LRU cache in the already-existing threadlocal...

In fact this is what we used to do (LinkedHashMap), and then in #3151 we moved to a cache shareable across threads.

But I agree – risk of hangs is not worth it. Let's just move back to thread-private cache, but let's make it tiny (8? 16? 13?) in size?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

we can also offer an option... fasterButMoreHangs...

asfimport commented 12 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19.

Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.

apache / lucene

TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug [LUCENE-3235] #4308

4648