Open siriusctrl opened 2 weeks ago
Describe the bug While saving mamba based model, distributed optimizer report an error in validation about dt_bias
dt_bias
To Reproduce Start the training of Mamba, and run it for a few step
Expected behavior The checkpoint should be saved without a problem
Stack trace/logs
from torch.distributed._sharded_tensor import ShardedTensor as TorchShardedTensor [ERROR | megatron.core.dist_checkpointing.validation]: Invalid access pattern for ShardedTensor(key='decoder.layers.channel_mixing.mixer.dt_bias', dtype=torch.bfloat16, local_shape=(128,), global_shape=(2, 128), global_offset=(0, 0), axis_fragmentations=(2, 1), replica_id=(0, 0, 0), prepend_axis_num=1, allow_shape_mismatch=False, flattened_range=None): tensor([[1],
Environment (please complete the following information):
Proposed fix No proposed fix
Additional context No additional context
Please will you provide a minimal script to repro this?
Also, please paste the rest of the error (I think there will be one more continuation line for tensor([[1], ...)
tensor([[1], ...
Describe the bug While saving mamba based model, distributed optimizer report an error in validation about
dt_bias
To Reproduce Start the training of Mamba, and run it for a few step
Expected behavior The checkpoint should be saved without a problem
Stack trace/logs
Environment (please complete the following information):
Proposed fix No proposed fix
Additional context No additional context