Oneflow-Inc / swin-transformer

0 stars 0 forks source link

[Bug] swin transformer eager global 遇到 nccl s2s shape 检查报错 #3

Open Ldpe2G opened 2 years ago

Ldpe2G commented 2 years ago

oneflow 版本:0.6.0.dev20211215+cu102

运行 run_vit_graph_success 分支下的 main_swin_eager_consistent_use_fake_data.sh 脚本能复现问题 报错在 eager_nccl_kernels.cu 中的 EagerNcclS2SKernel

F1220 16:35:30.940591 26889 eager_nccl_kernels.cu:286] Check failed: in->shape().elem_cnt() == out->shape().elem_cnt() (75264 vs. 76800) (2,4[84/1989]
 (4,25,768)
F1220 16:35:30.940574 26883 eager_nccl_kernels.cu:286] Check failed: in->shape().elem_cnt() == out->shape().elem_cnt() (75264 vs. 73728) (2,49,768) vs (4,24,768)
*** Check failure stack trace: ***
*** Check failure stack trace: ***
    @     0x7f0152c4f29d  google::LogMessage::Fail()
    @     0x7fe68429229d  google::LogMessage::Fail()
    @     0x7f0152c50b1a  google::LogMessage::SendToLog()
    @     0x7fe684293b1a  google::LogMessage::SendToLog()
    @     0x7f0152c4ed5d  google::LogMessage::Flush()
    @     0x7fe684291d5d  google::LogMessage::Flush()
    @     0x7f0152c51f49  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fe684294f49  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f014e5f777d  oneflow::EagerNcclS2SKernel<>::Compute()
    @     0x7fe67fc3a77d  oneflow::EagerNcclS2SKernel<>::Compute()
    @     0x7f014ce206e4  oneflow::vm::LocalCallOpKernelUtil::Compute()
    @     0x7fe67e4636e4  oneflow::vm::LocalCallOpKernelUtil::Compute()
    @     0x7f014cdecd2b  oneflow::vm::LocalCallOpKernelInstructionType::Compute()
    @     0x7fe67e42fd2b  oneflow::vm::LocalCallOpKernelInstructionType::Compute()
    @     0x7f014e21a3a2  oneflow::vm::CudaStreamType::Compute()
    @     0x7fe67f85d3a2  oneflow::vm::CudaStreamType::Compute()
    @     0x7f014e23928e  oneflow::vm::StreamType::Run()
    @     0x7fe67f87c28e  oneflow::vm::StreamType::Run()
    @     0x7f014e24fd15  oneflow::vm::VirtualMachineEngine::DispatchInstruction()
    @     0x7fe67f892d15  oneflow::vm::VirtualMachineEngine::DispatchInstruction()
    @     0x7f014e25217a  oneflow::vm::VirtualMachineEngine::DispatchAndPrescheduleInstructions()
    @     0x7fe67f89517a  oneflow::vm::VirtualMachineEngine::DispatchAndPrescheduleInstructions()
    @     0x7f014e2439f8  oneflow::VirtualMachine::Loop()
    @     0x7fe67f8869f8  oneflow::VirtualMachine::Loop()
    @     0x7f01532fdfdf  (unknown)
    @     0x7fe684940fdf  (unknown)
    @     0x7f02287d0ea5  start_thread
    @     0x7fe759e13ea5  start_thread
    @     0x7f0227df08dd  __clone
    @     0x7fe7594338dd  __clone
    @              (nil)  (unknown)
    @              (nil)  (unknown)
Killing subprocess 26376
clackhan commented 2 years ago

是有 S1 切分吗,这个报错是shape 中 split对应的 维度不能够整除引起的

Ldpe2G commented 2 years ago

是有 S1 切分吗,这个报错是shape 中 split对应的 维度不能够整除引起的

整个模型都是 b,输入是 s(0),还没定位到哪一步有可能推出了其他的 sbp

clackhan commented 2 years ago

整个模型都是 b,输入是 s(0),还没定位到哪一步有可能推出了其他的 sbp

有index 操作吗?

Ldpe2G commented 2 years ago

整个模型都是 b,输入是 s(0),还没定位到哪一步有可能推出了其他的 sbp

有index 操作吗?

有大量的 view 和 index 操作,但是我看 index 操作的 batch 维度都是完整取的

clackhan commented 2 years ago

有大量的 view 和 index 操作,但是我看 index 操作的 batch 维度都是完整取的

sbp 为 split的consistent tensor的index 操作目前有bug,需要先to成broadcast