aehrc / snorocket

The Snorocket Description Logic classifier for EL++ with concrete domains support
Apache License 2.0
22 stars 6 forks source link

interrupts and NIO and the Thread Pool Executor #9

Closed darmbrust closed 6 years ago

darmbrust commented 6 years ago

Hi, Have run into an odd issue, where the ThreadPool pattern of the classifier breaks other code.

This pattern: `` ExecutorService executor = Executors.newFixedThreadPool(numThreads); for (int j = 0; j < numThreads; j++) { Runnable worker = new TaxonomyWorker1(contextIndex, equiv, direc, factory, todo); executor.execute(worker); }

    executor.shutdown();
    while (!executor.isTerminated()) {
        try {
            Thread.sleep(100);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

    assert (todo.isEmpty());

Leads to a situation where the "shutdown()" call fires an interrupt at the idle threads in the pool. However, I'm guessing there are timing issues internally, about what is actually idle... in practice, I'm seeing interrupts hit threads that are or were involved with NIO on my side at one point or another, and this breaks the world, if NIO is involved.
https://docs.oracle.com/javase/7/docs/api/java/nio/channels/ClosedByInterruptException.html

Basically, my datastore, which uses NIO, gets killed by the call to shutdown() in the classifier, because the interrupt call kills the NIO file handle.

I think, that if you waited until todo.isEmpty() before calling shutdown() - that may solve the issue, but a simpler / more reliable solution may be to simply add an API where you allow us to pass in the executor to use - this way, I can hand in a thread pool I already have, and it doesn't have to be closed at all.

Thoughts?

lawley commented 6 years ago

Hi Dan, We're not sure how the interaction you're describing can happen. Since these are all local ThreadPools I would not expect the ExecutorService to be sharing anything with your code. The only thing where I see there might be an interaction would be that the Threads created by the ExecutorService use the defaultThreadFactory and thus all belong to the same ThreadGroup; this might then provide a link between your threads and Snorocket's.

Do you perhaps have some sample code that we can use to reproduce the problem so that we can test a prospective fix?

darmbrust commented 6 years ago

I'm not entirely sure on the full interaction either.... as much as I traced it in the debugger, anyway, the only interrupts being called during my run were by the classifier, and that corresponded roughly to when my DB would die with its FileChannel closed out from under it. I'll do some more debugging / tracing, and see if I can put together a simple test case.

darmbrust commented 6 years ago

When the issue first started happening, I had just assumed we were passing you an extended Axiom or something along those lines, that had a tie back to our data store that was doing a read.

After much more digging, I finally got to the bottom of it.

When I ran with the debugger attached, the only interrupts that were happening were coming from the classifier shutdown, and then I couldn't reproduce the problem.

When I would run without the debugger attached, the problem would happen right when the classifier was just finished shutting down.

What was actually happening, was the act of the classifier finishing was (sometimes) having a timing overlap with a tree update in the GUI, and the tree update code was using Task.cancel(), which fires interrupts by default.

So the classifier finishing at a particular time, would sometimes overlap with a tree build, and cause a new tree build to be kicked off, with a cancel being fired to the old one... which then broke the DB, because that interrupt would hit a thread that was doing reads from the DB.

A long way of saying, ignore this :)