GoogleCloudPlatform / cloudml-samples

Cloud ML Engine repo. Please visit the new Vertex AI samples repo at https://github.com/GoogleCloudPlatform/vertex-ai-samples
https://cloud.google.com/ai-platform/docs/
Apache License 2.0
1.52k stars 857 forks source link

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm) #451

Closed rishab-sharma closed 5 years ago

rishab-sharma commented 5 years ago

Bug Description I encounter this error, because of which the training breaks. Pytorch uses shared memory, because of which this issue arises.

Possible Solution --ipc=host or --shm-size during the docker run may fix this, but we don't have to do docker run during gcloud training.

Please suggest a possible solution for this or fix your example accordingly.

Error: ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)

nnegrey commented 5 years ago

Hi, which sample are you referring to?

rishab-sharma commented 5 years ago

Hi, I am referring to the PyTorch custom container sample.

The sample fails in case of large models which has more requirements in terms of shared memory when the num_workers > 0 in the dataloader. The online solution suggests to increase the shm-size during docker run, but in case of gcloud training, we don't have such an option.

Ref: https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/pytorch/containers/custom_container

nnegrey commented 5 years ago

Gotcha, have you played around with the different scale tiers and machine types?

https://cloud.google.com/ml-engine/docs/machine-types

The sample only uses BASIC to keep costs down for people exploring how to use AI Platform: "Basic: A single worker instance. This tier is suitable for learning how to use AI Platform and for experimenting with new models using small datasets."

rishab-sharma commented 5 years ago

Ya, I am using a custom scale tier with a standard v100 GPU, which is enough for training large models as I recall from experience, but as my dataloader workers are 0, it takes more time for training, thus costing me more, and if I increase my num_workers, the training crashes at a random instance of the training routine.

If there could be a possible solution for this, please suggest

rishab-sharma commented 5 years ago

Also is there any way to pass arguments for the docker run, which might be automatically invoked by the gcloud beta ai-platform jobs submit training command ?

nnegrey commented 5 years ago

So did some digging / chatting.

Right now, no you can not pass arguments to docker run.

I haven't yet been able to find a workaround for this with AI Platform, but you could create your own GCE or Kubernetes Cluster that allows you more control of the docker image.

Still looking around.

joaqo commented 5 years ago

I am having the same issue. This makes it not possible to train any pytorch model that uses Dataloaders with number of workers > 0, which basically rules out any computer vision model unless you want your GPU to be tremendously underused.

Pytorch's Dataloaders use multiprocessing and shared memory to load data into the model. The default shared memory for processes in docker is 64MB (--shm-size), which is way too low. This is independent of scale tier / machine type.

andrewferlitsch commented 5 years ago

The AI Platform product team is aware of the issue -- thanks for your patience.

joaqo commented 5 years ago

Awesome, thanks!

andrewferlitsch commented 5 years ago

closing as future feature requirement

sshrdp commented 2 years ago

This has been fixed in the service. Please reopen if you still have the issue.