NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.65k stars 2.39k forks source link

Set `torch.multiprocessing` start method as 'spawn' #1285

Open hxdtest opened 2 weeks ago

hxdtest commented 2 weeks ago

Set torch.multiprocessing start method as 'spawn'. Otherwise the following error would be raised.

Megatron-LM/megatron/core/extensions/transformer_engine.py", line 957, in get_cpu_offload_context
    context, sync_func = _get_cpu_offload_context(
  File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/cpu_offload.py", line 502, in get_cpu_offload_context
    cpu_offload_handler = AsyncDoubleBufferGroupOffloadHandler(
  File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/cpu_offload.py", line 312, in __init__
    self.d2h_stream = torch.cuda.Stream()
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/streams.py", line 35, in __new__
    return super().__new__(cls, priority=priority, **kwargs)
RuntimeError: CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.