Closed trtrung closed 4 years ago
There are three types of memory related to running NSML.
NSML default settings are as follows. 1.System memroy: 24GB 2.Shared memroy: 1GB 3.GPU: P40 (24GB fixed size)
You can adjust the 'shared memory' in the document below. https://n-clair.github.io/ai-docs/_build/html/en_US/contents/session/run_a_session.html
In the current environment, The value of "1.System memory + 2.shared memory" should be kept below 30GB.
Thank you! It worked.
I met a very critical issue. When my training process reached ~40 epochs, there was an error:
This error let the training process be terminated.
I also tried to run the same code (same batch_size) on NSML and my local machine. My local machine has only TITAN X GPU (12GB RAM). There was no error on my local machine, and it seemed that my local machine runs faster than NSML.
Could you tell me what is the problem here?