Closed ajratner closed 9 years ago
This seemed to happen exclusively on the compute optimized (c.3
) instances where the memory/core number is lowest...
@raphaelhoffmann Now I am getting abort messages (Aborting. Fatal error: run() received nonzero return code 123 while executing!
) from some of the nodes running, but can't find anything in the logs... (in previous instances of this I at least saw that the run.sh
processes had been killed- now I can't find any error output passed back to me by fabric...)
Have you seen this before / any ideas?
This might be a specific error with my data / something in the XML parsing I did to it... but either way the current distribute paradigm seems to make it a bit difficult to trace back errors / partial restart when they do occur
Even just knowing which servers were aborting would be helpful, which apparently is not returned as default by Fabric, but will see if they have an option
I haven't experienced this particular error before, but I do remember encountering the memory/core issue. CoreNLP just needs so much memory.
Regarding debugging: It would be great if stderr were logged on each node and also sent back to the submission node. Not sure how it is handled now (maybe not at all?), but this would be a great feature.
On Wed, Aug 5, 2015 at 9:59 AM, Alex Ratner notifications@github.com wrote:
@raphaelhoffmann https://github.com/raphaelhoffmann Now I am getting abort messages (Aborting. Fatal error: run() received nonzero return code 123 while executing!) from some of the nodes running, but can't find anything in the logs... (in previous instances of this I at least saw that the run.sh processes had been killed- now I can't find any error output passed back to me by fabric...)
Have you seen this before / any ideas?
This might be a specific error with my data / something in the XML parsing I did to it... but either way the current distribute paradigm seems to make it a bit difficult to trace back errors / partial restart when they do occur
— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/bazaar/issues/9#issuecomment-128070253.
@raphaelhoffmann Yeah- regarding crashes due to the high memory usage of coreNLP, this seemed to only have been a problem in ec2
on the compute-optimized instances (e.g. c3.4xlarge
) which have 1/2 the memory/core as e.g. the general purpose instances (e.g. m3.2xlarge
).
However a simple grep "Killed" fab_parse.log | wc -l
shows nothing this time, so something else has to be happening... and the problem is I don't know on what nodes or which segments. So really it would be fine if the abort error also spit out the segment that failed...
So I can look up on the xargs
& fab
documentation to see if there's a way to do this...
@raphaelhoffmann Stored the logs and looked through them a bit more, should have caught this before- aborts I ran into were all either caused by memory or hard drive out of space errors... see #11
In the parse operation, if the parallelism / batch size is set too high the process aborts due to killed processes...
I suppose a rough solution is to leave a large enough memory overhead... but large enough is poorly defined...