Closed kevinrr888 closed 4 weeks ago
Another thing I looked at was the death of the WorkFinder and how that affects the Manager. Upon looking at AccumuloUncaughtExceptionHandler and testing this, only Errors halt the JVM so the Manager continues to run normally if an Exception is thrown and uncaught in the WorkFinder. Should this be the case? It seems that the Manager should either halt or create a new WorkFinder and not continue to run normally if an Exception is uncaught. I'm wondering if this is a problem elsewhere as well. Or maybe this isn't a problem at all. Some insight into this would be appreciated.
Are there some Exception's which are not handled by WorkFinder?
Some context:
Here is the code for the WorkFinder
thread:
public void run() {
while (keepRunning.get()) {
try {
store.runnable(keepRunning, fateId -> {
while (keepRunning.get()) {
try {
// The reason for calling transfer instead of queueing is avoid rescanning the
// storage layer and adding the same thing over and over. For example if all threads
// were busy, the queue size was 100, and there are three runnable things in the
// store. Do not want to keep scanning the store adding those same 3 runnable things
// until the queue is full.
if (workQueue.tryTransfer(fateId, 100, MILLISECONDS)) {
break;
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new IllegalStateException(e);
}
}
});
} catch (Exception e) {
if (keepRunning.get()) {
log.warn("Failure while attempting to find work for fate", e);
} else {
log.debug("Failure while attempting to find work for fate", e);
}
workQueue.clear();
}
}
}
@dlmarion and I briefly talked about this earlier... I mentioned that there is an exception not handled by the WorkFinder
seeing that an IllegalStateException
is thrown. He pointed out that this is caught, which I missed. But, I'm still not sure if this will work as expected if an InterruptedException
is caught. I tested manually throwing an InterruptedException
(only once) in the nested try/catch. This causes the interrupt flag to be set true and also to throw an IllegalStateException
, which is caught and handled in the outer try/catch. However, I think setting the interrupt flag may cause issues. For example, one method this thread ends up calling is ThriftScanner.scan()
. This has a check:
if (Thread.currentThread().isInterrupted()) {
throw new AccumuloException("Thread interrupted");
}
This is caught in the outer catch of run()
indefinitely.
To summarize: if tryTransfer()
throws an InterruptedException
, the thread will indefinitely throw AccumuloException
s at least in some circumstances
Would it be better if the WorkFinder
was just terminated and a new one was started (or some other fix)? This may also just not be an issue, I'm not sure; I just found it strange when I was testing what would happen if an InterruptedException
occurred that the Manager logs would be flooded with AccumuloException
s and seemingly leave the WorkFinder
broken.
I'm wondering if Thread.currentThread().interrupt();
was meant to be Thread.currentThread().interrupted();
to clear the interrupt state.
@dlmarion Based on the source code of tryTransfer()
(and it seems like this is usually the case with most or all methods that throw an InterruptedException
), the interrupted flag is cleared when the exception is thrown. So maybe that was the original intention, but it would have had no effect. I think the existing code is what was intended, but may have some unexpected outcomes. The typical pattern for handling a caught InterruptedException
seems to be setting the interrupted flag to true (since it was cleared when the InterruptedException
is thrown) and potentially some further handling.
Also, at best the linked Manager code has no effect or is a mistake and was meant to be Thread.currentThread().interrupt();
. The method called in the try/catch which throws the InterruptedException
(ZooReaderWriter.delete(String)
) boils down to the InterruptedException
thrown from a call to Object.wait()
which clears the interrupt flag. So the Thread.interrupted();
here has no effect.
Would it be better if the WorkFinder was just terminated and a new one was started (or some other fix)? This may also just not be an issue, I'm not sure; I just found it strange when I was testing what would happen if an InterruptedException occurred that the Manager logs would be flooded with AccumuloExceptions and seemingly leave the WorkFinder broken.
This infinite interrupt loop does not sounds good., The work finder being unhealthy for any reason is not good though. It is catching exception to be resilient to occasional exceptions, but persistent exceptions are not good. Maybe we could do something like the following as simple way to detect persistent problems.
public void run() {
int exceptionCount = 0;
while (keepRunning.get()) {
try {
store.runnable(keepRunning, fateId -> {
while (keepRunning.get()) {
try {
// The reason for calling transfer instead of queueing is avoid rescanning the
// storage layer and adding the same thing over and over. For example if all threads
// were busy, the queue size was 100, and there are three runnable things in the
// store. Do not want to keep scanning the store adding those same 3 runnable things
// until the queue is full.
if (workQueue.tryTransfer(fateId, 100, MILLISECONDS)) {
break;
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new IllegalStateException(e);
}
}
});
// made a successful pass, so clear exception count
exceptionCount = 0;
} catch (Exception e) {
exceptionCount++;
if(exceptionCount > 30){
// repeatedly failing to find work, something is really wrong so throw an error to cause the process to exit
throw new ThreadPools.ExecutionError("Too many failures in workFinder", e);
}
if (keepRunning.get()) {
log.warn("Failure while attempting to find work for fate", e);
// TODO maybe sleep here so that do not retry immediately
} else {
log.debug("Failure while attempting to find work for fate", e);
}
workQueue.clear();
}
}
}
I just found it strange when I was testing what would happen if an InterruptedException occurred that the Manager logs would be flooded with AccumuloExceptions and seemingly leave the WorkFinder broken.
Overall the current behavior is not great. The way the code is structured with a loop it is ignoring interrupts and at the same time resetting them so it can ignore them in the future. For this situation seems like it would be best to change the catch block to the following dropping resetting the interrupted status.
} catch (InterruptedException e) {
throw new IllegalStateException(e);
}
I opened #4994
fixes an issue with how new fate transaction runners are created if one dies. Prior to this commit, the check for dead transaction runners would only occur once. This now occurs periodically.
Another thing I looked at was the death of the
WorkFinder
and how that affects the Manager. Upon looking atAccumuloUncaughtExceptionHandler
and testing this, only Errors halt the JVM so the Manager continues to run normally if an Exception is thrown and uncaught in theWorkFinder
. Should this be the case? It seems that the Manager should either halt or create a newWorkFinder
and not continue to run normally if an Exception is uncaught. I'm wondering if this is a problem elsewhere as well. Or maybe this isn't a problem at all. Some insight into this would be appreciated.this PR in conjunction with #4994 close #4906