NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
268 stars 52 forks source link

DistributedTransformerTest.MLP_Layer/3 fails with 6 processes. #2623

Closed wujingyue closed 3 months ago

wujingyue commented 3 months ago
$ _bn && mpirun -np 6 bin/test_multidevice --gtest_filter=DistributedTransformerTest.MLP_Layer/3
unknown file: Failure
C++ exception with description "Expected H % D == 0 to be true, but got false.
Exception raised from DistributedTransformerTest at /opt/pytorch/nvfuser/tests/cpp/test_multidevice_transformer.cpp:38 (most recent call first):
frame #0: <unknown function> + 0x9b531 (0x5634c93b8531 in bin/test_multidevice)
frame #1: <unknown function> + 0x7a4b0b (0x5634c9ac1b0b in bin/test_multidevice)
frame #2: <unknown function> + 0x7f78e1 (0x5634c9b148e1 in bin/test_multidevice)
frame #3: <unknown function> + 0x7e36a4 (0x5634c9b006a4 in bin/test_multidevice)
frame #4: <unknown function> + 0x7e461b (0x5634c9b0161b in bin/test_multidevice)
frame #5: <unknown function> + 0x7ecc74 (0x5634c9b09c74 in bin/test_multidevice)
frame #6: <unknown function> + 0x7e3c75 (0x5634c9b00c75 in bin/test_multidevice)
frame #7: <unknown function> + 0x11b7cc (0x5634c94387cc in bin/test_multidevice)
frame #8: <unknown function> + 0x29d90 (0x7fc960df7d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: __libc_start_main + 0x80 (0x7fc960df7e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #10: <unknown function> + 0x11f365 (0x5634c943c365 in bin/test_multidevice)
" thrown in the test fixture's constructor.
cowanmeg commented 3 months ago

I'll parameterize the input sizes so they scale better with number of gpus. Thanks for catching!