Closed wujingyue closed 3 months ago
$ _bn && mpirun -np 6 bin/test_multidevice --gtest_filter=DistributedTransformerTest.MLP_Layer/3
unknown file: Failure C++ exception with description "Expected H % D == 0 to be true, but got false. Exception raised from DistributedTransformerTest at /opt/pytorch/nvfuser/tests/cpp/test_multidevice_transformer.cpp:38 (most recent call first): frame #0: <unknown function> + 0x9b531 (0x5634c93b8531 in bin/test_multidevice) frame #1: <unknown function> + 0x7a4b0b (0x5634c9ac1b0b in bin/test_multidevice) frame #2: <unknown function> + 0x7f78e1 (0x5634c9b148e1 in bin/test_multidevice) frame #3: <unknown function> + 0x7e36a4 (0x5634c9b006a4 in bin/test_multidevice) frame #4: <unknown function> + 0x7e461b (0x5634c9b0161b in bin/test_multidevice) frame #5: <unknown function> + 0x7ecc74 (0x5634c9b09c74 in bin/test_multidevice) frame #6: <unknown function> + 0x7e3c75 (0x5634c9b00c75 in bin/test_multidevice) frame #7: <unknown function> + 0x11b7cc (0x5634c94387cc in bin/test_multidevice) frame #8: <unknown function> + 0x29d90 (0x7fc960df7d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #9: __libc_start_main + 0x80 (0x7fc960df7e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #10: <unknown function> + 0x11f365 (0x5634c943c365 in bin/test_multidevice) " thrown in the test fixture's constructor.
I'll parameterize the input sizes so they scale better with number of gpus. Thanks for catching!