NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.47k stars 2.35k forks source link

[BUG] Cannot Save mamba model in distributed training #1234

Open siriusctrl opened 2 weeks ago

siriusctrl commented 2 weeks ago

Describe the bug While saving mamba based model, distributed optimizer report an error in validation about dt_bias

To Reproduce Start the training of Mamba, and run it for a few step

Expected behavior The checkpoint should be saved without a problem

Stack trace/logs

  from torch.distributed._sharded_tensor import ShardedTensor as TorchShardedTensor
[ERROR    | megatron.core.dist_checkpointing.validation]: Invalid access pattern for ShardedTensor(key='decoder.layers.channel_mixing.mixer.dt_bias', dtype=torch.bfloat16, local_shape=(128,), global_shape=(2, 128), global_offset=(0, 0), axis_fragmentations=(2, 1), replica_id=(0, 0, 0), prepend_axis_num=1, allow_shape_mismatch=False, flattened_range=None): tensor([[1],

Environment (please complete the following information):

Proposed fix No proposed fix

Additional context No additional context

duncanriach commented 1 week ago

Please will you provide a minimal script to repro this?

mikolajblaz commented 1 week ago

Also, please paste the rest of the error (I think there will be one more continuation line for tensor([[1], ...)