While investigating #54 I made a minor change to check the socket connection within the readByteBuffer while loop. Race conditions might appear though because the socket connection can be closed from Vector side as well (and in this case we don't see it unless we read from the socket). I don't know if it's pure luck (or the issue is very hard to repro), but the "Connection to Vector end point has been closed or amount of data communicated does not match the message length" exceptions stopped showing themselves during my tests with a spark.dynamicAllocation.executorIdleTimeout=1 (executors are killed faster).
While investigating #54 I made a minor change to check the socket connection within the readByteBuffer while loop. Race conditions might appear though because the socket connection can be closed from Vector side as well (and in this case we don't see it unless we read from the socket). I don't know if it's pure luck (or the issue is very hard to repro), but the "Connection to Vector end point has been closed or amount of data communicated does not match the message length" exceptions stopped showing themselves during my tests with a spark.dynamicAllocation.executorIdleTimeout=1 (executors are killed faster).
The container exceptions from the attached log file (see mantis) are expected whenever containers get killed. The SQLListener exceptions though are due to a bug in Spark WebUI, see https://issues.apache.org/jira/browse/SPARK-12339 (and its fix https://github.com/apache/spark/pull/10405/files).
I think is worth anyway to commit this change.