Open duli2012 opened 1 year ago
Hi Du. For AMD backend, the upstream AIT breaks a lot of things. We fixed them in our AMD fork of AIT currently: https://github.com/ROCmSoftwarePlatform/AITemplate, and plan to upstream in the near future. So for now you can use the AMD fork to try thoese examples, and feel free to reach out if any further problems.
Thanks, Carlus! I tried the AMD fork. However, when I just run: python3 benchmark_ait.py. It encountered the following errors: ERROR:concurrent.futures:exception calling callback for <Future at 0x7fd46f049610 state=finished returned tuple> Traceback (most recent call last): File "/usr/lib/python3.8/concurrent/futures/_base.py", line 328, in _invoke_callbacks callback(self) File "/usr/local/lib/python3.8/dist-packages/aitemplate/backend/profiler_runner.py", line 263, in callback_when_done raise RuntimeError(f"Failed to extract profiler result for {cmds}") RuntimeError: Failed to extract profiler result for ['./tmp/profiler/bmm_softmax_bmm_permute/batched_gemm_softmax_gemm_permute_1_hhh_TNT_256_64_256_64_64_32_8_8_2_16_16_1_16_4_PT', '768', '384', '384', '64', '64', '12']
Du
@duli2012 sorry for late reply. This is due to not applicable kernel will interrupt current profiling. But it should not affect the whole build process. In the end this example should be able to run.
@carlushuang Hey sorry about the breakage. I wonder if it makes sense to add a rocm CI to guard it.
@carlushuang Please send a PR and we'll merge your fix into upstream, thanks!
@yinghai @ipiszy Thanks for mention the rocm backend CI. We have setup CI in the rocm fork and it works fine. There are several bug fixes to let everything works OK, as well as some porting of graph optimization from cuda to rocm backend.
About the rocm fix for upstream. Sure! We are going through some internal process about the upstream, but definitely that's doable. Will get back to this topic after the holiday.
cc @asroy
Hello,
I ran Bert example on MI-250x by using command: python3 examples/03_bert/benchmark_ait.py --batch-size 32 --seq-length 512 --encoders-only false
However, it aborted with the following errors: ./tmp/BERT_fast_gelu_32_512/batch_gather_1.cpp:27: int64_t (anonymous namespace)::GetInOffset(const int64_t, const K , const int64_t, const int64_t, const int64_t) [K = long]: Device-side assertion `idx >= 0 && idx < gather_dim_size' failed. ./tmp/BERT_fast_gelu_32_512/batch_gather_1.cpp:27: int64_t (anonymous namespace)::GetInOffset(const int64_t, const K , const int64_t, const int64_t, const int64_t) [K = long]: Device-side assertion `idx >= 0 && idx < gather_dim_size' failed. :0:rocdevice.cpp :2614: 2346741543469 us: 3634 : [tid:0x7fcd4f378700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016 Aborted (core dumped)
Does anyone know how to fix it? BTW, I'm using the AIT docker.
Thanks a lot.
Du