IDSIA-NLP / RAG_pretraining

experiments on RAG during pretraining
ChildFailedError during Elastic Launch in PyTorch Distributed #3

Open coree opened 1 month ago

coree commented 1 month ago

Encountered torch.distributed.elastic.multiprocessing.errors.ChildFailedError during the index_train step, running bash tools/retro/examples/ index-train

The run command in the

######## Command. ########

NPROCS=1 # Number of GPUs.
    cd ${REPO_DIR} && pwd && \
    python -m \
    --nproc_per_node ${NPROCS} \
    --nnodes 1 \
    --master_port 6000 \
    tools/retro/ ${ARGS} \

echo "~~~~~~~~~~~~~~~~~~~~~~~~~~"
echo "CMD = '$CMD'."
echo "~~~~~~~~~~~~~~~~~~~~~~~~~~"
eval $CMD


Stack trace/logs

munmap_chunk(): invalid pointer
0: E0808 12:25:47.074000 70369073825888 torch/distributed/elastic/multiprocessing/] failed (exitcode: -6) local_rank: 0 (pid: 292633) of binary: /usr/bin/python
0: Traceback (most recent call last):
0:   File "/usr/lib/python3.10/", line 196, in _run_module_as_main
0:     return _run_code(code, main_globals, None,
0:   File "/usr/lib/python3.10/", line 86, in _run_code
0:     exec(code, run_globals)
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/", line 883, in <module>
0:     main()
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/", line 347, in wrapper
0:     return f(*args, **kwargs)
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/", line 879, in main
0:     run(args)
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/", line 870, in run
0:     elastic_launch(
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/", line 132, in __call__
0:     return launch_agent(self._config, self._entrypoint, list(args))
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/", line 263, in launch_agent
0:     raise ChildFailedError(
0: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 


Proposed fixes and additional context

The error occurred while using the index-train script, when rerunning the script it happens at different execution stages. (This may indicate a potential memory issue.)

It's reported that comparable errors can have multiple causes, and there is no "easy fix". Some potential fixes have been discussed in the following issues:

The most common causes include:

Reported solutions include:

Improve debugging by:


or as suggested here:

from torch.distributed.elastic.multiprocessing.errors import record

def main(...)
coree commented 1 week ago

After adding from torch.distributed.elastic.multiprocessing.errors import record in the tools/retro/index/ file we can get a more informative error output.

The error munmap_chunk(): invalid pointer is not directly related to torch.distributed but seems to be caused by the file from Megatron and its overall setup/configuration process. The error occurs specifically when calling faiss.index_factory() during Megatron's initialization.

I tested various index configurations for faiss.index_factory, and the issue persisted. However, when testing faiss.index_factory() in isolation with torch.distributed, it worked fine. This indicates the problem is specific to the Megatron environment.

The error appears to arise when using the megatron/ to setup the environment, which involves CUDA device management (torch.cuda.set_device()), distributed initialization (torch.distributed.init_process_group()), and Megatron's fused kernels and custom operations. The error itself suggests a memory deallocation issue, pointing to improper memory handling during Megatron's initialization when interacting with FAISS.

In standalone tests, faiss.index_factory() works without issue, but it fails within Megatron's initialization process, likely due to a configuration or memory management problem specific to Megatron's setup.

Stack trace/log:

0: WARNING:__main__:
0: *****************************************
0: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
0: *****************************************
0: Zarr-based strategies will not be registered because of missing packages
0: munmap_chunk(): invalid pointer
0: Fatal Python error: Aborted
0: Thread 0x00004002e8e5f120 (most recent call first):
0:   File 
0: "/usr/lib/python3.10/"
0: , line 416 in select
0:   File "
0: /usr/lib/python3.10/multiprocessing/", line 931 in wait
0:   File "
0: /usr/lib/python3.10/concurrent/futures/", line 385 in 
0: wait_result_broken_or_wakeup
0:   File "
0: /usr/lib/python3.10/concurrent/futures/", line 320 in 
0: run
0:   File "
0: /usr/lib/python3.10/", line 1016 in 
0: _bootstrap_inner
0:   File "
0: /usr/lib/python3.10/"
0: , line 973
0:  in _bootstrap
0: Current thread 0x000040000fcb0860
0:  (most recent call first):
0:   File 
0: "/usr/local/lib/python3.10/dist-packages/faiss/", line 10838 in index_factory
0:   File "
0: /users/user1/workspace/Megatron-LM/tools/retro/index/indexes/
0: ", line 55
0:  in _train
0:   File "
0: /users/user1/workspace/Megatron-LM/tools/retro/index/indexes/"
0: , line 81
0:  in train
0:   File 
0: "/users/user1/workspace/Megatron-LM/tools/retro/index/
0: ", line 
0: 114 in 
0: train_on_embeddings
0:   File "
0: /users/user1/workspace/Megatron-LM/tools/retro/index/"
0: , line 139
0:  in train_index
0:   File 
0: "
0: /usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/"
0: , line 347
0:  in wrapper
0:   File "/capstor/scratch/cscs/user1/workspace/Megatron-LM/tools/retro/", line 224 in <module>
0: W0910 11:22:28.007000 70369117669472 torch/distributed/elastic/multiprocessing/] Sending process 45320 closing signal SIGTERM
0: W0910 11:22:28.009000 70369117669472 torch/distributed/elastic/multiprocessing/] Sending process 45321 closing signal SIGTERM
0: W0910 11:22:28.014000 70369117669472 torch/distributed/elastic/multiprocessing/] Sending process 45322 closing signal SIGTERM
0: E0910 11:22:28.712000 70369117669472 torch/distributed/elastic/multiprocessing/] failed (exitcode: -6) local_rank: 0 (pid: 45319) of binary: /usr/bin/python
0: Traceback (most recent call last):
0:   File "/usr/lib/python3.10/", line 196, in _run_module_as_main
0:     return _run_code(code, main_globals, None,
0:   File "/usr/lib/python3.10/", line 86, in _run_code
0:     exec(code, run_globals)
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/", line 883, in <module>
0:     main()
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/", line 347, in wrapper
0:     return f(*args, **kwargs)
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/", line 879, in main
0:     run(args)
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/", line 870, in run
0:     elastic_launch(
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/", line 132, in __call__
0:     return launch_agent(self._config, self._entrypoint, list(args))
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/", line 263, in launch_agent
0:     raise ChildFailedError(
0: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
0: ======================================================
0: tools/retro/ FAILED
0: ------------------------------------------------------
0: Failures:
0: ------------------------------------------------------
0: Root Cause (first observed failure):
0: [0]:
0:   rank      : 0 (local_rank: 0)
0:   exitcode  : -6 (pid: 45319)
0:   error_file: <N/A>
0:   traceback : Signal 6 (SIGABRT) received by PID 45319
0: ======================================================
srun: error: nid007416: task 0: Exited with exit code 1
srun: Terminating StepId=530039.0