Naver-AI-Hackathon / cs492I

2 stars 0 forks source link

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm) #10

Closed trtrung closed 4 years ago

trtrung commented 4 years ago

I met a very critical issue. When my training process reached ~40 epochs, there was an error:

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)

This error let the training process be terminated.

I also tried to run the same code (same batch_size) on NSML and my local machine. My local machine has only TITAN X GPU (12GB RAM). There was no error on my local machine, and it seemed that my local machine runs faster than NSML.

Could you tell me what is the problem here?

nsmluser commented 4 years ago

There are three types of memory related to running NSML.

  1. System Memory
  2. Shared memroy
  3. GPU.

NSML default settings are as follows. 1.System memroy: 24GB 2.Shared memroy: 1GB 3.GPU: P40 (24GB fixed size)

You can adjust the 'shared memory' in the document below. https://n-clair.github.io/ai-docs/_build/html/en_US/contents/session/run_a_session.html

In the current environment, The value of "1.System memory + 2.shared memory" should be kept below 30GB.

trtrung commented 4 years ago

Thank you! It worked.