[BUG] Cannot Save mamba model in distributed training

siriusctrl commented 2 weeks ago

Describe the bug While saving mamba based model, distributed optimizer report an error in validation about dt_bias

To Reproduce Start the training of Mamba, and run it for a few step

Expected behavior The checkpoint should be saved without a problem

Stack trace/logs

  from torch.distributed._sharded_tensor import ShardedTensor as TorchShardedTensor
[ERROR    | megatron.core.dist_checkpointing.validation]: Invalid access pattern for ShardedTensor(key='decoder.layers.channel_mixing.mixer.dt_bias', dtype=torch.bfloat16, local_shape=(128,), global_shape=(2, 128), global_offset=(0, 0), axis_fragmentations=(2, 1), replica_id=(0, 0, 0), prepend_axis_num=1, allow_shape_mismatch=False, flattened_range=None): tensor([[1],

Environment (please complete the following information):

Megatron-LM commit ID: 6e4e9df20bdf8fadc4ecb79a51944adfde38ab99
PyTorch version 12.6
CUDA version 12.6
NCCL version 2.22.3

Proposed fix No proposed fix

Additional context No additional context

duncanriach commented 1 week ago

Please will you provide a minimal script to repro this?

mikolajblaz commented 1 week ago

Also, please paste the rest of the error (I think there will be one more continuation line for tensor([[1], ...)

NVIDIA / Megatron-LM

[BUG] Cannot Save mamba model in distributed training #1234