EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
https://www.eleuther.ai/
Apache License 2.0
6.95k stars 1.02k forks source link

what's the biggest dataset you've tried? #1253

Open exnx opened 4 months ago

exnx commented 4 months ago

Hello, I have a dataset of 7T tokens, which when I run in the gpt-neox codebase, it creates about 5000 .npy files. I can get this to train for a 7B model on 32 gpus. But when I try to use 64 gpus, I get an error that says too many files have been opened, reaching the limit for max files opened. I believe it's opening a file decorator for each gpu and worker of all 5000 .npy files, so the mode gpus, the more files opened. Has anyone else ran into a similar limit? The current limit when typing ulimit -n is 1048576.

Here's the error I got:

GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/socket.py", line 546, in fromfd
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/threading.py", line 975, in run
GPUCA6E:     nfd = dup(fd)
GPUCA6E:             self._target(*self._args, **self._kwargs) 
GPUCA6E:  ^^^^^^^
GPUCA6E: OSError: [Errno 24] Too many open files
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop
GPUCA6E:     do_one_step()
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step
GPUCA6E:     r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
GPUCA6E:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/queues.py", line 122, in get
GPUCA6E:     return _ForkingPickler.loads(res)
GPUCA6E:            ^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 355, in rebuild_storage_fd
GPUCA6E:     fd = df.detach()
GPUCA6E:          ^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/resource_sharer.py", line 58, in detach
GPUCA6E:     return reduction.recv_handle(conn)
GPUCA6E:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 189, in recv_handle
GPUCA6E:     return recvfds(s, 1)[0]
GPUCA6E:            ^^^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 159, in recvfds
GPUCA6E:     raise EOFError
GPUCA6E: EOFError