Closed keien closed 5 years ago
@keien Thanks for the bug report! I'm labelling this as a bug and we'll update this thread when we have a resolution.
@keien I'm going to close this issue since it has been already solved by https://github.com/aws/aws-parallelcluster-node/pull/94 and released with the 2.2.1 version.
The same issue was already reported here: https://github.com/aws/aws-parallelcluster/issues/566 https://github.com/aws/aws-parallelcluster/issues/743
Please let us know if you have any questions.
Environment:
Bug description and how to reproduce: I've found that when AWS takes away spot instances, sometimes the cleanup happens correctly and doesn't leave zombie nodes behind, whereas in other cases it leaves zombie nodes behind that hold onto jobs in the
r
state.Yesterday one of our users started a massive job involving some 50+ p3.2xlarge spot instances, which are highly volatile, which resulted in some 15+ zombie nodes when I checked this morning. I saw some unusual logs in
/var/log/sqswatcher
so I thought I'd report it. See below:Additional context:
We had a bunch of these as spot instances were being taken away from us.