NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.23k stars 2.08k forks source link

[QUESTION] OSError: [Errno 28] No space left on device #878

Closed zhaoyz1017 closed 1 week ago

zhaoyz1017 commented 1 week ago

when i was running:

I am not sure about what cause this Error and how to fix this problem

deepakn94 commented 1 week ago

Are you running this in a docker container? What command if so?

zhaoyz1017 commented 1 week ago

Are you running this in a docker container? What command if so? yes, I run it in docker. docker run -it --name zhaomegatron -v /jfs/yuzhe.zhao:/home/zyz --gpus all nvcr.io/nvidia/pytorch:23.09-py3

zhaoyz1017 commented 1 week ago

Are you running this in a docker container? What command if so?

Thanks, I think this error is about docker container's shared memory. I fixed that with docker run --shm-size="64g"

deepakn94 commented 1 week ago

I believe --ipc=host should also work.