Closed pshirshov closed 6 years ago
As far as I can tell, this can't be a bug in ClassGraph, because ScanResult.allResources
(which is what the offending line is iterating through) is only initialized once, in the ScanResult
constructor. Is it possible you have some code in another thread that calls getAllResources()
, then modifies the returned ResourceList
?
Nope. This code is always being executed in a single thread and we never modify the list.
Though ConcurrentModificationException doesn't neccessary mean threading issues, it may be thrown if you modify a collection while iterating on it. But we don't modify, just read.
What am I missing? Please take a look at the code:
The code never modifies allResources
.
Are you using the latest version? Maybe I'm looking at the wrong line number.
Yup, latest. Actually it may be a sideffect, so I've tried to update my code to evaluate target collection in strict manner. In case doesn't help I'll report here.
A side effect of?
Though ConcurrentModificationException doesn't neccessary mean threading issues, it may be thrown if you modify a collection while iterating on it. But we don't modify, just read.
ClassGraph code doesn't modify the collection either, and based on the code example you gave, the only explanation would be interference by another thread. Or maybe asScala
does something funky? That wouldn't be expected though...
So, asScala
itself isn't strict but I can't see how may it break things.
Is ClassGraph API itself threadsafe? We don't use the collection across several threads, though we may invoke several scan()
s concurrently.
So,
asScala
itself isn't strict but I can't see how may it break things.
How so? What exactly does it do under the hood?
Is ClassGraph API itself threadsafe? We don't use the collection across several threads, though we may invoke several
scan()
s concurrently.
You should create one ClassGraph
instance per call to scan()
, and the scan should be launched from a single thread, since it is itself multithreaded. Technically you could reuse a ClassGraph
instance to launch another scan, but most of the configuration options are designed for one-off configuration, and the ClassGraph
class itself was not designed with multithreaded reuse in mind.
The ScanResult
itself should be threadsafe.
But in particular, just take a look at the source code for ScanResult
that I linked, based on the stacktrace you posted. There is simply no modification of the collection in question in the ScanResult
code, after the constructor has run, or in particular in the method in the stack trace (getResourcesMatchingPattern(ScanResult.java:383)
). This is why I said that I don't see any way that this bug could be in ClassGraph itself. I have never seen a ConcurrentModificationException
in any runs of any unit tests either, and there's relatively good coverage of the API among the unit tests.
How so? What exactly does it do under the hood?
In case you feed it with an iterable it would return you iterable, so asScala
doesn't trigger immediate evaluation. But it should be a problem in our case.
You should create one ClassGraph instance per call to scan()
That's exactly what we do.
just take a look at the source code for ScanResult
I did. Seems to be safe. Okay, I will try to dig deeper and check if my idea with enforcing evaluation helped or not.
Ah! I think I see one way this may be happening. I thought that ScanResult.allResources
was never modified after the constructor was run, but I missed one reference.
ScanResult#close()
contains the following code:
if (!isClosed) {
isClosed = true;
if (allResources != null) {
for (final Resource classpathResource : allResources) {
classpathResource.close();
}
allResources.clear(); // (*)
pathToResourceList.clear();
}
// [...]
}
The line marked (*)
is probably what is causing the ConcurrentModificationException
, due to a race condition in your code, with one thread calling ScanResult#close()
, and another thread continuing to use the ScanResult
instance after it is closed. It would have to be a race condition between two threads, because ScanResult#getResourcesMatchingPattern(final Pattern pattern)
checks if (isClosed)
on entry, and throws an exception if #close()
has already been called:
public ResourceList getResourcesMatchingPattern(final Pattern pattern) {
if (isClosed) {
throw new IllegalArgumentException("Cannot use a ScanResult after it has been closed");
}
if (allResources == null || allResources.isEmpty()) {
return new ResourceList(1);
} else {
final ResourceList filteredResources = new ResourceList();
for (final Resource classpathResource : allResources) {
final String relativePath = classpathResource.getPath();
if (pattern.matcher(relativePath).matches()) {
filteredResources.add(classpathResource);
}
}
return filteredResources;
}
}
Therefore, the thread that closes the ScanResult
would have to close it while the second thread is iterating through allResources
(i.e. after the isClosed
check).
The close()
method is the only place where ClassGraph modifies allResources
after the ScanResult
constructor exits. So there are only two possible explanations: (1) the race condition described above, or (2) your own code calling ScanResult#getAllResources()
and then subsequently modifying the resulting list in one thread while the thread you describe is trying to iterate through that same list by calling getResourcesMatchingPattern(Pattern)
.
Let me know if it looks like the race condition is at fault. If so, you should fix the code so that the ScanResult
is never accessed after calling ScanResult#close()
, if you're manually closing it.
It would have to be a race condition between two threads,
Shouldn't be. There is strictly one thread operating on the result. Moreover we don't even manage the threads, this is happening within our macro.
And the result of computation never leaves local scope, so it physically can't be shared across threads.
This is the old chunk of our code:
def smth() = {
...
val configUrls = new ClassGraph().scan()
.getResourcesMatchingPattern(configRegex.r.pattern)
.asScala
.map(_.getURL)
}
val referenceConfig = configUrls.foldLeft(ConfigFactory.empty())(_ withFallback ConfigFactory.parseURL(_)).resolve()
Some(new ConfigModule(AppConfig(referenceConfig)))
}
So, the method may be invoked by several threads but nothing is shared and the object, leaving the method is strict.
This chunk of code isn't correct - the resource was never closed. Though it shouldn't lead to CME by any chance in case you don't call .close()
from finalize
Though as you can see the object is not being closed here so I'm wondering if it may be some finalization logic+escape analysis. Though from what I can remember in case you didn't invoke close
- noone would do that for you.
OK, there's one more possible scenario then -- it is a threading issue, but not your threads. The other thread is the garbage collector.
The documentation says several times that you should allocate the ScanResult
(the result of ClassGraph#scan()
in a try-with-resources block, or to close the result manually in a finally
block in Scala (since Scala doesn't yet have try-with-resources). You are not doing that in your code. The reasons given for using ScanResult
in try-with-resources are so that there aren't any unpredictably large GC pauses when the garbage collector calls ScanResult#close()
(since close()
is also called by ScanResult#finalize()
).
What must be happening is that Scala must take the idea of "tail recursion" / "tail calls" to the limit, and after new ClassGraph().scan()
is called, returning the ScanResult
, as soon as .getResourcesMatchingPattern(configRegex.r.pattern)
is called, Scala realizes there are no more things that reference the ScanResult
, and the ScanResult
is scheduled for garbage collection. The garbage collection takes long enough to start that the if (isClosed)
check in getResourcesMatchingPattern(Pattern)
never fails. But then the ScanResult
is garbage collected, triggering the finalizer to run, and ScanResult#close()
gets called. This causes the allResources.clear()
line to be run while the for (final Resource classpathResource : allResources) { }
loop is still running. That's a pretty complicated chain of events!
So the fix for your code should be to simply do this, as recommended in the docs (albeit for a different reason):
val scanResult = new ClassGraph().scan()
try {
val configUrls = scanResult.getResourcesMatchingPattern(configRegex.r.pattern)
.asScala
.map(_.getURL)
// ... use configURLs
} finally {
scanResult.close()
}
Please confirm if this fixes the problem for you.
However, since I suspect that this is not the last time this will come up, I will change the close()
method so that it does not clear the internal collections. This creates a bit more work for the garbage collector, but it's better to solve this for cases where users forget to use try-with-resources or manually close the ScanResult
in a finally
block.
And actually, if manually closing the scanResult
in a finally
block does solve the issue for you, this is a very significant bug in Scala, since Scala should not be calling a finalizer
for an object when methods of the object are still running! This would be a very interesting finding. (I know finalizers are considered "broken by design", but Scala still needs to honor their API contract.)
I pushed out a fix as version 4.2.9. Please test for the following two scenarios (both of these should fix the problem):
scanResult.close()
in the finally
blockfinally
block (i.e. your original code)Ahhh, so you call close()
from finalizer. In that case it explains the issue, it's escape analysis+this.
I've rewritten that chunk of code yesterday with try/finally, so it should fix the issue.
is a very significant bug in Scala, since Scala should not be calling a finalizer for an object when methods of the object are still running!
Scala has nothing to deal with it. I bet it's escape analysis.
I will check how would it work with try/finally
. Till the moment there were no failures, so I'm almost sure it's fixed.
Ahhh, so you call close() from finalizer.
Yes, by design. It is also called by a shutdown hook, to be extra sure it is cleaned up. ClassGraph can create lots of temporary files in cases where you scan lots of inner jars (jars within jars), since they have to be extracted to be scanned.
See here:
https://github.com/classgraph/classgraph/wiki/API:-ScanResult#scanresult-lifecycle
Scala has nothing to deal with it. I bet it's escape analysis.
That's what I'm saying -- that Scala's escape analysis must be broken (or incomplete). Scala should have held a reference to the ScanResult
against garbage collection when it knows that it just barely invoked a method on the very object that was scheduled for garbage collection.
In other words, there should be one reference to the invocation target held on the stack or in the heap during a method call, to guarantee that the invocation target cannot be garbage collected while the object's own methods are being executed. When the method returns, the invocation target reference can be released.
There is no way that the invocation target of a running method should ever be allowed to be garbage collected! If Scala is allowing this, it is broken.
That's what I'm saying -- that Scala's escape analysis must be broken (or incomplete).
This is JVM itself. Scala has to do nothing with GC policies/etc. -XX:+DoEscapeAnalysis
. We always use this option and I think this is the first time we have any issues with it (though we've recently switched to jdk11, maybe they have more agressive policies).
Yes, the reference is still there, within asScala
adaper. So this looks like an EA problem.
Scala has to do nothing with GC policies/etc.
But Scala could fix this behavior -- regardless of whether it is caused by a JVM runtime option or not -- by keeping around a reference to the invocation target, right? Or not? I'm pretty sure Java survives this, due to holding stack frames as long as possible (Scala is smarter about dropping references immediately after last usage).
I believe you would find that if you translated your code snippet directly into Java, you would not see the problem on 4.2.8.
-XX:+DoEscapeAnalysis
I see, I didn't know this was a JVM option. Usually compilers do their own escape analysis, not the VM. Why do you enable this? Is it for memory conservation or something?
Please see the two numbered testing requests a few comments back, in case you missed that. Thanks!
But Scala could fix this behavior -- regardless of whether it is caused by a JVM runtime option or not -- by keeping around a reference to the invocation target, right?
There is a reference, within adapter object.
trait IterableWrapperTrait[A] extends ju.AbstractCollection[A] {
val underlying: Iterable[A]
def size = underlying.size
override def iterator = IteratorWrapper(underlying.iterator)
override def isEmpty = underlying.isEmpty
}
Don't know why jvm misses it.
Why do you enable this? Is it for memory conservation or something?
This is a standard option for situations where you produce a lot of temporary local objects. It reduces GC time bcs these objects may be freed just in time.
Anyway, thanks for the help.
Please see the two numbered testing requests a few comments back, in case you missed that. Thanks!
Yup, will check.
OK, then I agree with you now that this is a JVM problem :-D Sorry!
It would be interesting to try to replicate this behavior with a simple case, e.g.:
public static class ClassWithFinalizer {
private boolean finalized = false;
public void longRunningMethod() {
for (int i = 0; i < 10; i++) {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
}
System.out.println(i + " : " + finalized);
}
}
@Override
protected void finalize() throws Throwable {
finalized = true;
}
}
Try calling new ClassWithFinalizer().longRunningMethod()
and see if you can replicate the brokenness in your same JVM environment with the same commandline options. (The code should never output "true".)
This is a standard option for situations where you produce a lot of temporary local objects. It reduces GC time bcs these objects may be freed just in time.
Ah, I was wondering how the GC was getting notified so quickly that the invocation target was not used in the rest of the block. Usually (at least in Java), the GC doesn't even check if a reference is no longer needed until the containing scope is popped off the stack.
Don't know why jvm misses it.
Aleksey Shipilëv would be very interested to know about this situation, probably, if it turns out that enabling escape analysis does in fact cause finalizers to run while methods in those objects are still running. That's very serious!
I mentioned:
Usually compilers do their own escape analysis, not the VM.
From this article by Aleksey: https://shipilev.net/jvm-anatomy-park/18-scalar-replacement/
But what really happens is that the compiler performs the so called Escape Analysis (EA), which can identify which newly created objects are not escaping into the heap, and then it can do a few interesting optimizations. Note that EA itself is not the optimization, it is the analysis phase that gives important pieces of data for the optimizer.
However, I did find one interesting fact: any class that overrides finalize()
is supposed to be marked as a GlobalEscape
class, which is forced to be allocated on the heap, so post-escape-analysis optimizations cannot be applied to that class. This is the case with ScanResult
, so using the -XX:+DoEscapeAnalysis
should have no effect whatsoever on ScanResult
. Whatever is supposed to mark ScanResult
as a GlobalEscape
class (because it overrides finalize()
) is failing.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.37.3797&rep=rep1&type=pdf (top of page 11)
As far as I can tell, it is the compiler's responsibility to determine if ScanResult
overrides finalize()
when a new ScanResult
is created. But I haven't yet figured out how the compiler is supposed to notify the JVM as to whether a given object reference is supposed to be marked as heap-only, since the classfile format spec doesn't have any mention of escape analysis.
I tried replicating this behavior in Scala and Java, and could not trigger it using .getResourcesMatchingPattern(Pattern)
using -XX:+DoEscapeAnalysis
in JDK 11.
I did manage to replicate the behavior in ClassGraph-4.2.8 using .getAllResources()
, after that method had returned, but that ConcurrentModificationException
actually makes sense, because the code in .getAllResources()
had already exited when the finalizer was run. (In your case, the .getResourcesMatchingPattern(Pattern)
method is still running when the finalizer is run.)
ClassGraph-4.2.9 fixes the ConcurrentModificationException
that I was able to trigger, so it should fix your case too, but I would still be interested in figuring out how your runtime environment ended up running the finalizer early.
This was a very rare case. I will try to get back with more info.
So, this snippet breaks with 100% probability:
import io.github.classgraph.{ClassGraph, ClassInfo}
import java.lang.management.ManagementFactory
import scala.collection.JavaConverters._
println(ClassGraph.getVersion)
println(System.getProperties.asScala.filterKeys(_.startsWith("java.")).mkString("\n"))
println(ManagementFactory.getRuntimeMXBean.getInputArguments)
while (true) {
new ClassGraph().scan()
.getResourcesMatchingPattern(".*\\.conf".r.pattern)
.asScala
.map(_.getURL)
}
I'm just executing it in IDEA repl.
The output:
So, no threading, no even escape analysis flat, nothing.
So, modified snippet to get the full trace:
import io.github.classgraph.{ClassGraph, ClassInfo}
import java.lang.management.ManagementFactory
import scala.collection.JavaConverters._
import scala.compat.Platform.ConcurrentModificationException
println(ClassGraph.getVersion)
println(System.getProperties.asScala.filterKeys(_.startsWith("java.")).mkString("\n"))
println(ManagementFactory.getRuntimeMXBean.getInputArguments)
while (true) {
try {
new ClassGraph().scan()
.getResourcesMatchingPattern(".*\\.conf".r.pattern)
.asScala
.map(_.getURL)
} catch {
case cme: ConcurrentModificationException =>
cme.printStackTrace()
throw cme
}
}
Tried to switch on to Oracle JDK 8.
Tried exactly the same snippet on ClassGraph 4.2.9 - it works.
asScala
does the following:
implicit def asScalaBufferConverter[A](l : ju.List[A]): AsScala[mutable.Buffer[A]] =
new AsScala(asScalaBuffer(l))
def asScalaBuffer[A](l: ju.List[A]): mutable.Buffer[A] = l match {
case null => null
case MutableBufferWrapper(wrapped) => wrapped
case _ => new JListWrapper(l)
}
and JListWrapper
is
case class JListWrapper[A](underlying: ju.List[A]) extends mutable.AbstractBuffer[A] with mutable.Buffer[A] {
I'm not sure what exactly is happening there. From all I can understand, finalizers shouldn't be triggered ahead of time.
@pshirshov Thanks for running these tests. I'm glad 4.2.9 seems to have fixed this for now.
However, this bug report made me realize that the change I made in 4.2.9 is insufficient to prevent all issues of this form. Even if the garbage collector does correctly wait until the last reference to a ScanResult
to be dropped before it runs the finalizer, the finalizer removes temporary files, and this can cause classloading to fail. So in Scala, if you don't hold a reference to the ScanResult
and call finalize
in a finally
block, then if you call ClassInfoList#loadClasses()
or ClassInfo#loadClass()
in a method chain like you do, it is possible for classloading to fail.
Therefore, the correct fix for this is not just to upgrade to 4.2.9, but you really must close the ScanResult
in a finally
block in Scala. I will update the documentation to strengthen the wording around this.
I'll take up the escape analysis issue with the JVM performance team to see if they can isolate why the class is not being marked as GlobalEscape
(i.e. why the finalizer is running early).
The updated documentation:
https://github.com/classgraph/classgraph/wiki/API:-ScanResult#scanresult-lifecycle
@pshirshov So I raised this on the JDK mailing lists, and though I knew lots of people claim finalizers are broken, it turns out that finalizers are more or less "broken by design".
http://mail.openjdk.java.net/pipermail/jdk-dev/2018-October/002067.html
http://mail.openjdk.java.net/pipermail/hotspot-dev/2018-October/034625.html
The Hotspot team made a decision to not include the object instances of currently-running methods in liveness analysis, which feeds into reachability analysis. So if a method is running, it is not considered live (and therefore reachable) unless it can be definitely determined that one of its fields will be accessed in the future -- and therefore, it may be finalized while the method is running.
This is a very bad design decision in my opinion, and something that can be fixed with very low overhead to Hotspot. But they made the decision in JDK 9 to add Reference::reachabilityFence
instead, which you have to wrap all methods of finalizable objects with.
This is only available in JDK 9+, and doesn't even solve all cases of finalizers being run early. For example, wrapping the methods of ScanResult
with reachabilityFence
will still fail to prevent the finalizer being run early in the following example, where ClassInfo
objects have a back-ref to the ScanResult
:
new ClassGraph().enableClassInfo().scan().getAllClassInfo().loadClasses()
So this is unfortunately a hopeless situation to try to fix, until or unless the Hotspot team change their policy on choosing a tiny performance margin over arguable correctness.
What I will do instead is simply remove the finalizer method, and users of ClassGraph will simply have to deal with resource leaks if they don't close the ScanResult
. This is reasonable, as it is the expectation for any other class that extends AutoCloseable
or Closeable
.
I will release 4.2.10 without the finalizer method momentarily. You should update your code to close ScanResult
in a finally
block, as shown in the link in the previous comment on this issue.
Thanks for your report, and for your patience -- this bug was very hard to get to the bottom of!
I will release 4.2.10 without the finalizer method momentarily. You should update your code to close ScanResult in a finally block, as shown in the link in the previous comment on this issue.
Yup, I've already did that. And yes, I support the drop of finalization logic.
Thanks a lot for your investigation, it's something I was completely unaware of.
@pshirshov Here are my release notes summarizing what I learned:
https://github.com/classgraph/classgraph/releases/tag/classgraph-4.2.10
Thanks a lot for your investigation, it's something I was completely unaware of.
You're welcome, and I had no idea of this either -- I was really, really surprised by this. And I was even more surprised that this is not only a well-known issue to the Hotspot team, but that they opted not to fix it for quite insignificant performance reasons. Everybody knows that some things about finalizers are broken, but they had a chance to fix some things about how they worked, to make them a little less broken, and they chose not to fix them.
This really sucks... it's one of a growing, long list of things that I keep discovering as I dive even deeper into the Java ecosystem that are frankly very disappointing :-/
@pshirshov Oh, one more thing that came up during the discussions on the JDK mailing lists is that -XX:+DoEscapeAnalysis
is now default (adding that option to the JVM invocation has no effect).
I know this is closed but... Wow... That's really messed up.
Correctness should be the core thing. Never give up correctness for performance, ffs.
I mean, you expect that what you write is going to be done, not undefined behavior!
@andrebrait that was my reaction exactly. Correctness should never be sacrificed for performance. If this meant that you lose function inlining when you override finalize
, then so be it, that's the price you pay. The fact that in over a decade of debating this issue, the Java community process was unable to fix this -- explicitly because they didn't want to give up what is probably a tiny performance margin, which it seems nobody even actually benchmarked (it was all assumption) -- shows just how broken the "design by committee" of the JCP is.
Running finalizers late (or not at all) is not necessarily a big deal. Running finalizers early very likely breaks everything that uses finalizers.
Some good commentary from David Lloyd of Red Hat:
http://mail.openjdk.java.net/pipermail/jdk-dev/2018-October/002085.html
Another interesting email, from Hans Boehm (author of the Boehm garbage collector):
I don't think inlining is the real problem. The general problem is that just because a field of an object is still in use, does not mean that the containing object is still reachable. The register holding the reference may have been reused. e.g. for the field value itself. The receiver reference inside a method is really just like any other parameter. And it's common to pass parameters in registers that can be reused once we no longer need the parameter values, e.g. because we've retrieved the fields we need.
One could solve this by preventing elimination of dead references. But, it's hard to do this only where it matters. Doing it everywhere is an option, but hasn't gotten a lot of traction. There is a more elaborate proposal to make this easier than reachabilityFence at https://docs.google.com/document/d/1yMC4VzZobMQTrVAraMy7xBdWPCJ0p7G5r_43yaqsj-s/edit?usp=sharing
Currently reachabilityFence is by far the best solution. (Passing the reference to an existing native method, e.g. by removing "static" is more subtle, but otherwise as good when it works.) I believe the cost of reachabilityFence in implementations that support it is near zero; all it has to do is keep a value around on the stack or in a register. In my experience, it's quite hard to portably, efficiently, and robustly implement reachabilityFence() where it's not already supported. Passing it to another Java method may work, but is not portable.
ReachabilityFence still has the substantial disadvantage that it often needs to be used in many methods after referring to a field that is cleaned up by a Cleaner or finalizer, etc. The above proposal allows the field declaration to be modified, instead of every method that touches it. But the programmer still needs to know that there is an issue. (@ReachabilitySensitive is a placeholder name. @Cleaned is the current front runner in my vicinity.)
Eh, I've uncovered a pandora's box. My world will never be the same.
Though the conclusion is simple: we must never use finalizers except of situations when our app is completely independent of when they may be executed and if they may be executed. Crap-crap-crap.
More on the problem:
https://shipilev.net/jvm-anatomy-park/8-local-var-reachability/
that is a good read
Sometimes this happens:
My call looks innocent enough:
Is it possible to fix it?