Hello, I have a dataset of 7T tokens, which when I run in the gpt-neox codebase, it creates about 5000 .npy files. I can get this to train for a 7B model on 32 gpus. But when I try to use 64 gpus, I get an error that says too many files have been opened, reaching the limit for max files opened. I believe it's opening a file decorator for each gpu and worker of all 5000 .npy files, so the mode gpus, the more files opened. Has anyone else ran into a similar limit? The current limit when typing ulimit -n is 1048576.
Here's the error I got:
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/socket.py", line 546, in fromfd
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/threading.py", line 975, in run
GPUCA6E: nfd = dup(fd)
GPUCA6E: self._target(*self._args, **self._kwargs)
GPUCA6E: ^^^^^^^
GPUCA6E: OSError: [Errno 24] Too many open files
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop
GPUCA6E: do_one_step()
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step
GPUCA6E: r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
GPUCA6E: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/queues.py", line 122, in get
GPUCA6E: return _ForkingPickler.loads(res)
GPUCA6E: ^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 355, in rebuild_storage_fd
GPUCA6E: fd = df.detach()
GPUCA6E: ^^^^^^^^^^^
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/resource_sharer.py", line 58, in detach
GPUCA6E: return reduction.recv_handle(conn)
GPUCA6E: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 189, in recv_handle
GPUCA6E: return recvfds(s, 1)[0]
GPUCA6E: ^^^^^^^^^^^^^
GPUCA6E: File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 159, in recvfds
GPUCA6E: raise EOFError
GPUCA6E: EOFError
Hello, I have a dataset of 7T tokens, which when I run in the gpt-neox codebase, it creates about 5000 .npy files. I can get this to train for a 7B model on 32 gpus. But when I try to use 64 gpus, I get an error that says too many files have been opened, reaching the limit for max files opened. I believe it's opening a file decorator for each gpu and worker of all 5000 .npy files, so the mode gpus, the more files opened. Has anyone else ran into a similar limit? The current limit when typing
ulimit -n
is 1048576.Here's the error I got: