memory leak resulting in hung threads

btwilk commented 1 year ago

Version: 2.15 I'm model checking a fairly complex piece of code and found that several iterations complete successfully but eventually lincheck detects hung threads.

Increasing heap size allows more iterations to succeed. This made me suspect a memory leak across iterations. I analyzed some heap dumps and found that most memory was consumed by ModelCheckingStrategy instances which contain huge (100s of MB) trees of potential context switch choices. It seems that each iteration uses a fresh ModelCheckingStrategy instance but something is holding on to references to the instances from completed iterations so that they are not GC'd.

E.g. when I take a snapshot during iteration 3, I see 3 instances of ModelCheckingStrategy in the dump despite there having been full GCs.

btwilk commented 1 year ago

I know I inundated you recently with a lot of bug reports, but I wanted to call out that this one is the most debilitating. I'm running with a 48GB heap and still can only do 10-25 iterations depending on the test, and I need to fork JVMs for each test class as the memory leaks pile up across test classes too.

Wondering if you've found the cause of the leak.

alefedor commented 1 year ago

Hi @btwilk !

Thank you for your valuable bug reports :)
Yeah, a memory leak can result in hung threads report.

I've checked tests for java.util.concurrent.ConcurrentHashMap/ConcurrentDeque/ConcurrentSkipList and jctools.NonBlockingHashMapLong but was not able to observe more than one ModelCheckingStrategy instance.

I know there something like a memory leak in LinearizabilityVerifier because it caches scenario results and keeps this cache between iterations, however, from your description, this is not your case. Without this leak, the memory consumption graphs seem healthy to me.

So, can you please provide some insight into what kind of tests you are running?
In particular, I am interested in the following:

Do operations in your tests have parameters or results of non-primitive types (i.e., actual objects, not Int, Long, Bool, etc)?
How big are the scenarios you generate? The number of interleavings and valid results grows exponentially with the number of threads and operations, so usually no more than 3 threads and 3 operations in each thread are used in tests (and we do not know an real example when a bug requires more threads/operations)

btwilk commented 1 year ago

Answering your questions:

I currently have non-primitive parameters but looking back at my commit history, I had already reduced the iteration count to 25 with only primitive parameters.
I happen to have also landed on 3 threads with 3 operations as being a sweet spot.

I too see that simple tests for e.g. ConcurrentHashMap don't leak ModelCheckingStrategies. I'll see if I can figure out what it is about my use case that seems to trigger a leak.

btwilk commented 1 year ago

Here is an example that runs out of memory on iteration 27 with -Xmx256m:

class LeakTest {

    private var x = 0
    private val mutex = Object()

    @Operation
    fun getAndInc(): List<Int> {
        synchronized(mutex) {
            return listOf(x++)
        }
    }

    @Test
    fun test() = ModelCheckingOptions()
        .iterations(Integer.MAX_VALUE)
        .logLevel(LoggingLevel.INFO)
        .check(this::class)
}

alefedor commented 1 year ago

@btwilk

Yeah, that's exactly why I asked about primitive types. Here the result type is not primitive (List<Int>).

This problem is currently being fixed. If in your original tests there were only primitive types, then the cause of a leak is still unknown

btwilk commented 1 year ago

I seem to be able to get my tests running stably after eliminating non-primitive params / return types in the lincheck test as well as eliminating use of log4j in the system under test. I don't have a nice minimal example to share re: log4j.

btwilk commented 1 year ago

Any idea when your fix might be released? eager to try it out!

ndkoval commented 1 year ago

Fixed under #128

JetBrains / lincheck

memory leak resulting in hung threads #124