NVIDIA / NeMo-Run

A tool to configure, launch and manage your machine learning experiments.
Apache License 2.0
57 stars 11 forks source link

Segmentation fault when using the dev container #16

Open jeffchy opened 2 months ago

jeffchy commented 2 months ago

Segmentation fault when using the dev container to train the llm finetune recipe:

nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 strategies:244] Fixing mis-match between ddp-config & mcore-optimizer config
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:314] Rank 0 has data parallel group : [0]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:320] Rank 0 has combined group of data parallel and context parallel : [0]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:325] All data parallel group ranks with context parallel combined: [[0], [1], [2], [3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:328] Ranks 0 has data parallel rank: 0
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:336] Rank 0 has context parallel group: [0]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:339] All context parallel group ranks: [[0], [1], [2], [3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:340] Ranks 0 has context parallel rank: 0
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:347] Rank 0 has model parallel group: [0, 1, 2, 3]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:348] All model parallel group ranks: [[0, 1, 2, 3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:357] Rank 0 has tensor model parallel group: [0]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:361] All tensor model parallel group ranks: [[0], [1], [2], [3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:362] Rank 0 has tensor model parallel rank: 0
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:382] Rank 0 has pipeline model parallel group: [0, 1, 2, 3]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:394] Rank 0 has embedding group: [0, 3]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:400] All pipeline model parallel group ranks: [[0, 1, 2, 3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:401] Rank 0 has pipeline model parallel rank 0
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:402] All embedding group ranks: [[0, 1, 2, 3]]
nemo.collections.llm.api.finetune/0 [NeMo I 2024-08-28 07:01:29 megatron_init:403] Rank 0 has embedding rank: 0
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Config.bf16  False -> True
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Config.params_dtype  torch.float32 -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Config.pipeline_dtype  None -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Config.autocast_dtype  None -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Llama3Config8B.bf16  False -> True
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Llama3Config8B.params_dtype  torch.float32 -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Llama3Config8B.pipeline_dtype  None -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote Llama3Config8B.autocast_dtype  torch.float32 -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote OptimizerConfig.params_dtype  torch.float32 -> torch.bfloat16
nemo.collections.llm.api.finetune/0 [NeMo W 2024-08-28 07:01:29 mixed_precision:195] Overwrote DistributedDataParallelConfig.grad_reduce_in_fp32  False -> True
nemo.collections.llm.api.finetune/0 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
nemo.collections.llm.api.finetune/0 `zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
nemo.collections.llm.api.finetune/0 `zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
nemo.collections.llm.api.finetune/0 `zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
nemo.collections.llm.api.finetune/0 [08/28/2024-07:01:35] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
nemo.collections.llm.api.finetune/0 [TensorRT-LLM] TensorRT-LLM version: 0.11.0
nemo.collections.llm.api.finetune/0 [08/28/2024-07:01:36] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
nemo.collections.llm.api.finetune/0 [TensorRT-LLM] TensorRT-LLM version: 0.11.0
nemo.collections.llm.api.finetune/0 [08/28/2024-07:01:36] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
nemo.collections.llm.api.finetune/0 [TensorRT-LLM] TensorRT-LLM version: 0.11.0
nemo.collections.llm.api.finetune/0 Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
nemo.collections.llm.api.finetune/0 Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
nemo.collections.llm.api.finetune/0 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
nemo.collections.llm.api.finetune/0 ----------------------------------------------------------------------------------------------------
nemo.collections.llm.api.finetune/0 distributed_backend=nccl
nemo.collections.llm.api.finetune/0 All distributed processes registered. Starting with 4 processes
nemo.collections.llm.api.finetune/0 ----------------------------------------------------------------------------------------------------
nemo.collections.llm.api.finetune/0
nemo.collections.llm.api.finetune/0 [10-7-133-247:16170:0:17316] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3800)
nemo.collections.llm.api.finetune/0 [10-7-133-247:16172:0:17317] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3800)
nemo.collections.llm.api.finetune/0 [10-7-133-247:16171:0:17318] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3800)
nemo.collections.llm.api.finetune/0 [10-7-133-247:15836:0:17315] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3800)
nemo.collections.llm.api.finetune/0 ==== backtrace (tid:  17316) ====
nemo.collections.llm.api.finetune/0  0 0x0000000000042520 __sigaction()  ???:0
nemo.collections.llm.api.finetune/0  1 0x00000000000736aa pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  2 0x00000000000766f1 pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  3 0x000000000005a30a ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  4 0x000000000005fe72 ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  5 0x000000000004cf9c pncclRedOpDestroy()  ???:0
nemo.collections.llm.api.finetune/0  6 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
nemo.collections.llm.api.finetune/0  7 0x0000000000125a04 clone()  ???:0
nemo.collections.llm.api.finetune/0 =================================
nemo.collections.llm.api.finetune/0 ==== backtrace (tid:  17317) ====
nemo.collections.llm.api.finetune/0  0 0x0000000000042520 __sigaction()  ???:0
nemo.collections.llm.api.finetune/0  1 0x00000000000736aa pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  2 0x00000000000766f1 pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  3 0x000000000005a30a ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  4 0x000000000005fe72 ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  5 0x000000000004cf9c pncclRedOpDestroy()  ???:0
nemo.collections.llm.api.finetune/0  6 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
nemo.collections.llm.api.finetune/0  7 0x0000000000125a04 clone()  ???:0
nemo.collections.llm.api.finetune/0 =================================
nemo.collections.llm.api.finetune/0 ==== backtrace (tid:  17318) ====
nemo.collections.llm.api.finetune/0  0 0x0000000000042520 __sigaction()  ???:0
nemo.collections.llm.api.finetune/0  1 0x00000000000736aa pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  2 0x00000000000766f1 pncclCommDeregister()  ???:0
nemo.collections.llm.api.finetune/0  3 0x000000000005a30a ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  4 0x000000000005fe72 ncclCommAbort()  ???:0
nemo.collections.llm.api.finetune/0  5 0x000000000004cf9c pncclRedOpDestroy()  ???:0
nemo.collections.llm.api.finetune/0  6 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
nemo.collections.llm.api.finetune/0  7 0x0000000000125a04 clone()  ???:0
jeffchy commented 2 months ago

solve by using 24.07 image and install Nemo-Run + upgrade Nemo (build from source) manually

hemildesai commented 2 months ago

Thanks @jeffchy for creating the issue. Glad to know you were able to fix it. Please let us know if you run into this issue again, and if it's ok to close the issue for now since you were able to solve it.

jeffchy commented 2 months ago

I'm able pass the phase I mentioned above, but it then raise CheckPointError

ericharper commented 2 months ago

@jeffchy is that the same error as above or a new one? Could you share it if it's new?

jeffchy commented 2 months ago

it's a new one, I'll try to reproduce the error.

jeffchy commented 2 months ago

Update: I can successfully run the newest pretrain recipe https://github.com/NVIDIA/NeMo/blob/main/examples/llm/run/llama3_pretraining.py

but failed when I want to use fientune_recipe and own model. I replace the hf_resume() with:

def hf_resume() -> Config[nl.AutoResume]:
    return Config(nl.AutoResume, import_path="hf://{my local model path}")

And I got

llama3-8b/0 [default3]:[rank3]:     self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
llama3-8b/0 [default3]:[rank3]:   File "/workspace/NeMo/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 636, in load_optimizer_state_dict
llama3-8b/0 [default3]:[rank3]:     optimizer_states = checkpoint["optimizer"]
llama3-8b/0 [default3]:[rank3]: KeyError: 'optimizer'

I'm not familiar with nemo, maybe I got something wrong?

marcromeyn commented 2 months ago

import_path is a special argument that's intended for only HF -> NeMo model converts. If your model is already trained using NeMo, you don't need that. In that can you can use: path as opposed to import_path.

jeffchy commented 1 month ago

Thanks for your reply, but if I have a custom fine-tuned HF model (on local device), how to start from it? Do I need to convert it in advance?