Open BillyBag2 opened 4 years ago
I have exactly the same problem. Did you find out, what was the problem? Because I have quite small manifest file and I would not liked to reduce it more
Hi @BillyBag2 and @koles289,
I had this exact same error while following this tutorial and was totally confused. I had trained with the same process before, but this time I was getting this weird error.
Then, I was looking through my all_augmented.json and saw that some rows were null. I was messing up in the augmentation. I fixed it and the rows that were null now contained proper manifest rows and my model trained.
Basically, you guys are probably having another small formatting error in your all_augmented.json or validation.manifest like empty null rows.
Hope this helps.
Hi @BillyBag2 and @koles289,
I had this exact same error while following this tutorial and was totally confused. I had trained with the same process before, but this time I was getting this weird error.
Then, I was looking through my all_augmented.json and saw that some rows were null. I was messing up in the augmentation. I fixed it and the rows that were null now contained proper manifest rows and my model trained.
Basically, you guys are probably having another small formatting error in your all_augmented.json or validation.manifest like empty null rows.
Hope this helps.
I believe this is not related to empty null rows at all (I have none of that in my data) I keep getting the same error on the same manifest files
Details of my experiment: I did a test run with 10 epochs and it finished successfully So, I cloned that and only increased the epochs and it keeps failing with that time out error
I hope there's a solution to this somewhere. Can't find any, so far, and the error messages aren't that helpful. A wild guess is that the instance itself is crashing, but I can't guess the reasons for that
Using the Object detection model Resnet-50 with AugmentedManifestFile I get a time out error (see log below filtered on "pipe") after about an hour of training. If I reduce the manifest size I do not get this error. (I'm also getting the warning
[03/19/2020 16:13:34 WARNING 139923161179968] Expected number of batches: 1932, did not match the number of batches processed: 616. This may happen when some images or annotations are invalid and cannot be parsed. Please check the dataset and ensure it follows the format in the documentation.
not sure if it is related.)18000
36000
Enabled
ml.p2.xlarge
(Memory and disk are not running out.)
[03/19/2020 15:59:12 INFO 139923161179968] The channel 'train' is in pipe input mode under /opt/ml/input/data/train.
[03/19/2020 15:59:12 INFO 139923161179968] The channel 'validation' is in pipe input mode under /opt/ml/input/data/validation.
[03/19/2020 15:59:12 INFO 139923161179968] The channel 'train' is in pipe input mode under /opt/ml/input/data/train.
[03/19/2020 15:59:12 INFO 139923161179968] The channel 'validation' is in pipe input mode under /opt/ml/input/data/validation.
[15:59:12] /opt/brazil-pkg-cache/packages/AIApplicationsPipeIterators/AIApplicationsPipeIterators-1.0.1081.0/AL2012/generic-flavor/src/data_iter/src/ease_det_image_iter.cpp:41: ImageDetRecordIOParser: pipe:///opt/ml/input/data/train, use 3 threads for decoding..
[15:59:13] /opt/brazil-pkg-cache/packages/AIApplicationsPipeIterators/AIApplicationsPipeIterators-1.0.1081.0/AL2012/generic-flavor/src/data_iter/src/ease_det_image_iter.cpp:41: ImageDetRecordIOParser: pipe:///opt/ml/input/data/validation, use 3 threads for decoding..
Platform Error: SageMaker pipe channel timed out.
[16:55:10] /opt/brazil-pkg-cache/packages/AIApplicationsPipeIterators/AIApplicationsPipeIterators-1.0.1081.0/AL2012/generic-flavor/src/data_iter/src/channel.cpp:124: (Platform Error) FIFO 4 of the SageMaker pipe channel '/opt/ml/input/data/train' timed out.
Platform Error: SageMaker pipe channel timed out.
[16:55:10] /opt/brazil-pkg-cache/packages/AIApplicationsPipeIterators/AIApplicationsPipeIterators-1.0.1081.0/AL2012/generic-flavor/src/data_iter/src/channel.cpp:124: (Platform Error) FIFO 4 of the SageMaker pipe channel '/opt/ml/input/data/train' timed out.