IDSIA-NLP / RAG_pretraining

experiments on RAG during pretraining
Other
0 stars 0 forks source link

ChildFailedError during Elastic Launch in PyTorch Distributed #3

Open coree opened 1 month ago

coree commented 1 month ago

Encountered torch.distributed.elastic.multiprocessing.errors.ChildFailedError during the index_train step, running bash tools/retro/examples/preprocess_data.sh index-train

The run command in the preprocess_data.sh:

######## Command. ########

NPROCS=1 # Number of GPUs.
CMD="\
    cd ${REPO_DIR} && pwd && \
    export PYTHONPATH=$PYTHONPATH:${REPO_DIR} && \
    python -m torch.distributed.run \
    --nproc_per_node ${NPROCS} \
    --nnodes 1 \
    --master_port 6000 \
    tools/retro/main.py ${ARGS} \
"

echo "~~~~~~~~~~~~~~~~~~~~~~~~~~"
echo "CMD = '$CMD'."
echo "~~~~~~~~~~~~~~~~~~~~~~~~~~"
eval $CMD

MORE SPECIFICATIONS WILL BE PROVIDED WHEN TODI IS BACK ONLINE

Stack trace/logs

munmap_chunk(): invalid pointer
0: E0808 12:25:47.074000 70369073825888 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: -6) local_rank: 0 (pid: 292633) of binary: /usr/bin/python
0: Traceback (most recent call last):
0:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
0:     return _run_code(code, main_globals, None,
0:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
0:     exec(code, run_globals)
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 883, in <module>
0:     main()
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
0:     return f(*args, **kwargs)
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
0:     run(args)
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
0:     elastic_launch(
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
0:     return launch_agent(self._config, self._entrypoint, list(args))
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
0:     raise ChildFailedError(
0: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

Environment

Proposed fixes and additional context

The error occurred while using the index-train script, when rerunning the script it happens at different execution stages. (This may indicate a potential memory issue.)

It's reported that comparable errors can have multiple causes, and there is no "easy fix". Some potential fixes have been discussed in the following issues:

The most common causes include:

Reported solutions include:


Improve debugging by:

export TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO

or as suggested here:

from torch.distributed.elastic.multiprocessing.errors import record

@record
def main(...)
coree commented 1 week ago

After adding from torch.distributed.elastic.multiprocessing.errors import record in the tools/retro/index/build.py file we can get a more informative error output.

The error munmap_chunk(): invalid pointer is not directly related to torch.distributed but seems to be caused by the initialize.py file from Megatron and its overall setup/configuration process. The error occurs specifically when calling faiss.index_factory() during Megatron's initialization.

I tested various index configurations for faiss.index_factory, and the issue persisted. However, when testing faiss.index_factory() in isolation with torch.distributed, it worked fine. This indicates the problem is specific to the Megatron environment.

The error appears to arise when using the megatron/initialize.py to setup the environment, which involves CUDA device management (torch.cuda.set_device()), distributed initialization (torch.distributed.init_process_group()), and Megatron's fused kernels and custom operations. The error itself suggests a memory deallocation issue, pointing to improper memory handling during Megatron's initialization when interacting with FAISS.

In standalone tests, faiss.index_factory() works without issue, but it fails within Megatron's initialization process, likely due to a configuration or memory management problem specific to Megatron's setup.

Stack trace/log:

0: WARNING:__main__:
0: *****************************************
0: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
0: *****************************************
0: Zarr-based strategies will not be registered because of missing packages
0: munmap_chunk(): invalid pointer
0: Fatal Python error: Aborted
0: 
0: Thread 0x00004002e8e5f120 (most recent call first):
0:   File 
0: "/usr/lib/python3.10/selectors.py"
0: , line 416 in select
0:   File "
0: /usr/lib/python3.10/multiprocessing/connection.py", line 931 in wait
0:   File "
0: /usr/lib/python3.10/concurrent/futures/process.py", line 385 in 
0: wait_result_broken_or_wakeup
0:   File "
0: /usr/lib/python3.10/concurrent/futures/process.py", line 320 in 
0: run
0:   File "
0: /usr/lib/python3.10/threading.py", line 1016 in 
0: _bootstrap_inner
0:   File "
0: /usr/lib/python3.10/threading.py"
0: , line 973
0:  in _bootstrap
0: 
0: 
0: Current thread 0x000040000fcb0860
0:  (most recent call first):
0:   File 
0: "/usr/local/lib/python3.10/dist-packages/faiss/swigfaiss.py", line 10838 in index_factory
0:   File "
0: /users/user1/workspace/Megatron-LM/tools/retro/index/indexes/faiss_base.py
0: ", line 55
0:  in _train
0:   File "
0: /users/user1/workspace/Megatron-LM/tools/retro/index/indexes/faiss_base.py"
0: , line 81
0:  in train
0:   File 
0: "/users/user1/workspace/Megatron-LM/tools/retro/index/build.py
0: ", line 
0: 114 in 
0: train_on_embeddings
0:   File "
0: /users/user1/workspace/Megatron-LM/tools/retro/index/build.py"
0: , line 139
0:  in train_index
0: 
0:   File 
0: "
0: /usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py"
0: , line 347
0:  in wrapper
0:   File "/capstor/scratch/cscs/user1/workspace/Megatron-LM/tools/retro/main.py", line 224 in <module>
0: 
0: 
0: Extension modules: numpy.core._multiarray_umath
0: , numpy.core._multiarray_tests
0: , numpy.linalg._umath_linalg
0: , numpy.fft._pocketfft_internal
0: , numpy.random._common
0: , 
0: numpy.random.bit_generator
0: , numpy.random._bounded_integers
0: , numpy.random._mt19937
0: , numpy.random.mtrand
0: , numpy.random._philox
0: , numpy.random._pcg64
0: , numpy.random._sfc64
0: , numpy.random._generator
0: , torch._C, torch._C._fft
0: , torch._C._linalg
0: , torch._C._nested
0: , torch._C._nn
0: , torch._C._sparse
0: , torch._C._special
0: , regex._regex
0: , cython.cimports.libc.math
0: , Cython.Utils
0: , Cython.Plex.Actions
0: , Cython.Plex.Transitions
0: , Cython.Plex.Machines
0: , Cython.Plex.DFA
0: , Cython.Plex.Scanners
0: , Cython.Compiler.Scanning
0: , Cython.StringIOTree
0: , Cython.Compiler.Code
0: , charset_normalizer.md
0: , requests.packages.charset_normalizer.md
0: , requests.packages.chardet.md
0: , yaml._yaml
0: , 
0: google._upb._message
0: , markupsafe._speedups
0: , PIL._imaging, PIL._imagingft
0: , h5py._errors
0: , h5py.defs
0: , h5py._objects
0: , h5py.h5, h5py.utils
0: , h5py.h5t, h5py.h5s
0: , h5py.h5ac, h5py.h5p
0: , h5py.h5r
0: , h5py._proxy
0: , h5py._conv, h5py.h5z
0: , h5py.h5a
0: , h5py.h5d
0: , h5py.h5ds
0: , h5py.h5g, h5py.h5i
0: , h5py.h5f
0: , h5py.h5fd, h5py.h5pl
0: , h5py.h5o
0: , h5py.h5l
0: , h5py._selector
0: , sentencepiece._sentencepiece
0: , scipy._lib._ccallback_c
0: , numpy.linalg.lapack_lite
0: , scipy.sparse._sparsetools
0: , _csparsetools, scipy.sparse._csparsetools
0: , scipy.linalg._fblas
0: , scipy.linalg._flapack
0: , scipy.linalg.cython_lapack
0: , scipy.linalg._cythonized_array_utils
0: , scipy.linalg._solve_toeplitz, scipy.linalg._flinalg
0: , scipy.linalg._decomp_lu_cython
0: , scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas
0: , scipy.linalg._matfuncs_expm
0: , scipy.linalg._decomp_update
0: , scipy.sparse.linalg._dsolve._superlu
0: , scipy.sparse.linalg._eigen.arpack._arpack
0: , scipy.sparse.csgraph._tools
0: , scipy.sparse.csgraph._shortest_path
0: , scipy.sparse.csgraph._traversal
0: , scipy.sparse.csgraph._min_spanning_tree, 
0: scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching
0: , scipy.sparse.csgraph._reordering
0: , scipy.spatial._ckdtree
0: , scipy._lib.messagestream
0: , scipy.spatial._qhull
0: , scipy.spatial._voronoi
0: , scipy.spatial._distance_wrap, 
0: scipy.spatial._hausdorff
0: , scipy.special._ufuncs_cxx
0: , scipy.special._ufuncs
0: , scipy.special._specfun, 
0: scipy.special._comb
0: , scipy.special._ellip_harm_2
0: , 
0: scipy.spatial.transform._rotation
0: , scipy.ndimage._nd_image
0: , _ni_label
0: , scipy.ndimage._ni_label
0: , scipy.optimize._minpack2
0: , scipy.optimize._group_columns
0: , scipy.optimize._trlib._trlib
0: , scipy.optimize._lbfgsb
0: , _moduleTNC
0: , scipy.optimize._moduleTNC
0: , scipy.optimize._cobyla
0: , 
0: scipy.optimize._slsqp
0: , scipy.optimize._minpack
0: , scipy.optimize._lsq.givens_elimination
0: , scipy.optimize._zeros
0: , scipy.optimize._highs.cython.src._highs_wrapper
0: , scipy.optimize._highs._highs_wrapper
0: , scipy.optimize._highs.cython.src._highs_constants
0: , scipy.optimize._highs._highs_constants
0: , scipy.linalg._interpolative
0: , scipy.optimize._bglu_dense
0: , 
0: scipy.optimize._lsap
0: , scipy.optimize._direct
0: , scipy.integrate._odepack
0: , scipy.integrate._quadpack
0: , 
0: scipy.integrate._vode
0: , scipy.integrate._dop
0: , scipy.integrate._lsoda
0: , 
0: scipy.special.cython_special
0: , scipy.stats._stats
0: , scipy.stats.beta_ufunc
0: , scipy.stats._boost.beta_ufunc
0: , scipy.stats.binom_ufunc
0: , scipy.stats._boost.binom_ufunc
0: , 
0: scipy.stats.nbinom_ufunc
0: , scipy.stats._boost.nbinom_ufunc
0: , scipy.stats.hypergeom_ufunc
0: , scipy.stats._boost.hypergeom_ufunc
0: , scipy.stats.ncf_ufunc
0: , scipy.stats._boost.ncf_ufunc
0: , scipy.stats.ncx2_ufunc
0: , scipy.stats._boost.ncx2_ufunc
0: , scipy.stats.nct_ufunc
0: , scipy.stats._boost.nct_ufunc
0: , 
0: scipy.stats.skewnorm_ufunc
0: , scipy.stats._boost.skewnorm_ufunc
0: , scipy.stats.invgauss_ufunc
0: , scipy.stats._boost.invgauss_ufunc
0: , 
0: scipy.interpolate._fitpack
0: , scipy.interpolate.dfitpack, scipy.interpolate._bspl
0: , scipy.interpolate._ppoly
0: , scipy.interpolate.interpnd
0: , 
0: scipy.interpolate._rbfinterp_pythran
0: , scipy.interpolate._rgi_cython
0: , scipy.stats._biasedurn
0: , scipy.stats._levy_stable.levyst
0: , 
0: scipy.stats._stats_pythran
0: , scipy._lib._uarray._uarray
0: , 
0: scipy.stats._ansari_swilk_statistics
0: , scipy.stats._sobol
0: , scipy.stats._qmc_cy
0: , scipy.stats._mvn
0: , scipy.stats._rcont.rcont
0: , scipy.stats._unuran.unuran_wrapper
0: , 
0: sklearn.__check_build._check_build
0: , sklearn.utils.murmurhash
0: , psutil._psutil_linux
0: , 
0: psutil._psutil_posix
0: , sklearn.utils._isfinite, 
0: sklearn.utils._openmp_helpers
0: , sklearn.utils._logistic_sigmoid
0: , sklearn.utils.sparsefuncs_fast
0: , sklearn.preprocessing._csr_polynomial_expansion
0: , sklearn.utils._typedefs
0: , sklearn.utils._readonly_array_wrapper
0: , sklearn.metrics._dist_metrics
0: , sklearn.metrics.cluster._expected_mutual_info_fast
0: , sklearn.metrics._pairwise_distances_reduction._datasets_pair
0: , sklearn.utils._cython_blas
0: , sklearn.metrics._pairwise_distances_reduction._base
0: , sklearn.metrics._pairwise_distances_reduction._middle_term_computer
0: , sklearn.utils._heap
0: , sklearn.utils._sorting
0: , sklearn.metrics._pairwise_distances_reduction._argkmin
0: , 
0: sklearn.utils._vector_sentinel
0: , sklearn.metrics._pairwise_distances_reduction._radius_neighbors
0: , sklearn.metrics._pairwise_fast
0: , faiss._swigfaiss (total: 189
0: )
0: W0910 11:22:28.007000 70369117669472 torch/distributed/elastic/multiprocessing/api.py:906] Sending process 45320 closing signal SIGTERM
0: W0910 11:22:28.009000 70369117669472 torch/distributed/elastic/multiprocessing/api.py:906] Sending process 45321 closing signal SIGTERM
0: W0910 11:22:28.014000 70369117669472 torch/distributed/elastic/multiprocessing/api.py:906] Sending process 45322 closing signal SIGTERM
0: E0910 11:22:28.712000 70369117669472 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: -6) local_rank: 0 (pid: 45319) of binary: /usr/bin/python
0: Traceback (most recent call last):
0:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
0:     return _run_code(code, main_globals, None,
0:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
0:     exec(code, run_globals)
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 883, in <module>
0:     main()
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
0:     return f(*args, **kwargs)
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
0:     run(args)
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
0:     elastic_launch(
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
0:     return launch_agent(self._config, self._entrypoint, list(args))
0:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
0:     raise ChildFailedError(
0: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
0: ======================================================
0: tools/retro/main.py FAILED
0: ------------------------------------------------------
0: Failures:
0:   <NO_OTHER_FAILURES>
0: ------------------------------------------------------
0: Root Cause (first observed failure):
0: [0]:
0:   rank      : 0 (local_rank: 0)
0:   exitcode  : -6 (pid: 45319)
0:   error_file: <N/A>
0:   traceback : Signal 6 (SIGABRT) received by PID 45319
0: ======================================================
srun: error: nid007416: task 0: Exited with exit code 1
srun: Terminating StepId=530039.0