aws / sagemaker-spark-container

The SageMaker Spark Container is a Docker image used to run data processing workloads with the Spark framework on Amazon SageMaker.
Apache License 2.0
36 stars 74 forks source link

If primary is down abnormally, worker should exit with error #51

Open larroy opened 3 years ago

larroy commented 3 years ago

Worker doesn't exit with error when the primary is down abnormally as the StatusMessage is not checked. Would it be possible to exit workers with error when primary is down abnormally?

See flow here:

https://github.com/aws/sagemaker-spark-container/blob/master/src/smspark/job.py#L185

apacker commented 3 years ago

What impact does the worker not exiting with an error have? Presumably if the primary goes down prematurely it will exit with error and cause the job to fail. Is that not the case?

larroy commented 3 years ago

@xgchena

larroy commented 3 years ago

I think it results in false success messages in the worker algos which can cause confusion. I think for the primary yes it will exit with an error as you say. I think we should fail also the worker containers though.

apacker commented 3 years ago

Understood, and agreed this causes confusion. We'll have to update our shutdown logic to ensure the driver node communicates to the workers to exit successfully.

Thanks for the feedback, we're working this into our roadmap internally.