aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.09k stars 6.76k forks source link

"Platform Error: SageMaker pipe channel timed out." after one hour of training. #1097

Open BillyBag2 opened 4 years ago

BillyBag2 commented 4 years ago

Using the Object detection model Resnet-50 with AugmentedManifestFile I get a time out error (see log below filtered on "pipe") after about an hour of training. If I reduce the manifest size I do not get this error. (I'm also getting the warning [03/19/2020 16:13:34 WARNING 139923161179968] Expected number of batches: 1932, did not match the number of batches processed: 616. This may happen when some images or annotations are invalid and cannot be parsed. Please check the dataset and ensure it follows the format in the documentation. not sure if it is related.)

(Memory and disk are not running out.)

[03/19/2020 15:59:12 INFO 139923161179968] The channel 'train' is in pipe input mode under /opt/ml/input/data/train.

[03/19/2020 15:59:12 INFO 139923161179968] The channel 'validation' is in pipe input mode under /opt/ml/input/data/validation.

[03/19/2020 15:59:12 INFO 139923161179968] The channel 'train' is in pipe input mode under /opt/ml/input/data/train.

[03/19/2020 15:59:12 INFO 139923161179968] The channel 'validation' is in pipe input mode under /opt/ml/input/data/validation.

[15:59:12] /opt/brazil-pkg-cache/packages/AIApplicationsPipeIterators/AIApplicationsPipeIterators-1.0.1081.0/AL2012/generic-flavor/src/data_iter/src/ease_det_image_iter.cpp:41: ImageDetRecordIOParser: pipe:///opt/ml/input/data/train, use 3 threads for decoding..

[15:59:13] /opt/brazil-pkg-cache/packages/AIApplicationsPipeIterators/AIApplicationsPipeIterators-1.0.1081.0/AL2012/generic-flavor/src/data_iter/src/ease_det_image_iter.cpp:41: ImageDetRecordIOParser: pipe:///opt/ml/input/data/validation, use 3 threads for decoding..

Platform Error: SageMaker pipe channel timed out.

[16:55:10] /opt/brazil-pkg-cache/packages/AIApplicationsPipeIterators/AIApplicationsPipeIterators-1.0.1081.0/AL2012/generic-flavor/src/data_iter/src/channel.cpp:124: (Platform Error) FIFO 4 of the SageMaker pipe channel '/opt/ml/input/data/train' timed out.

Platform Error: SageMaker pipe channel timed out.

[16:55:10] /opt/brazil-pkg-cache/packages/AIApplicationsPipeIterators/AIApplicationsPipeIterators-1.0.1081.0/AL2012/generic-flavor/src/data_iter/src/channel.cpp:124: (Platform Error) FIFO 4 of the SageMaker pipe channel '/opt/ml/input/data/train' timed out.

koles289 commented 4 years ago

I have exactly the same problem. Did you find out, what was the problem? Because I have quite small manifest file and I would not liked to reduce it more

Raunak-Singh-Inventor commented 4 years ago

Hi @BillyBag2 and @koles289,

I had this exact same error while following this tutorial and was totally confused. I had trained with the same process before, but this time I was getting this weird error.

Then, I was looking through my all_augmented.json and saw that some rows were null. I was messing up in the augmentation. I fixed it and the rows that were null now contained proper manifest rows and my model trained.

Basically, you guys are probably having another small formatting error in your all_augmented.json or validation.manifest like empty null rows.

Hope this helps.

Raunak-Singh-Inventor commented 4 years ago

1316

MuhammadMotawe commented 3 years ago

Hi @BillyBag2 and @koles289,

I had this exact same error while following this tutorial and was totally confused. I had trained with the same process before, but this time I was getting this weird error.

Then, I was looking through my all_augmented.json and saw that some rows were null. I was messing up in the augmentation. I fixed it and the rows that were null now contained proper manifest rows and my model trained.

Basically, you guys are probably having another small formatting error in your all_augmented.json or validation.manifest like empty null rows.

Hope this helps.

I believe this is not related to empty null rows at all (I have none of that in my data) I keep getting the same error on the same manifest files

Details of my experiment: I did a test run with 10 epochs and it finished successfully So, I cloned that and only increased the epochs and it keeps failing with that time out error

I hope there's a solution to this somewhere. Can't find any, so far, and the error messages aren't that helpful. A wild guess is that the instance itself is crashing, but I can't guess the reasons for that