facebookincubator / AITemplate

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
Apache License 2.0
4.55k stars 369 forks source link

Error to run Bert example on AMD #118

Open duli2012 opened 1 year ago

duli2012 commented 1 year ago

Hello,

I ran Bert example on MI-250x by using command: python3 examples/03_bert/benchmark_ait.py --batch-size 32 --seq-length 512 --encoders-only false

However, it aborted with the following errors: ./tmp/BERT_fast_gelu_32_512/batch_gather_1.cpp:27: int64_t (anonymous namespace)::GetInOffset(const int64_t, const K , const int64_t, const int64_t, const int64_t) [K = long]: Device-side assertion `idx >= 0 && idx < gather_dim_size' failed. ./tmp/BERT_fast_gelu_32_512/batch_gather_1.cpp:27: int64_t (anonymous namespace)::GetInOffset(const int64_t, const K , const int64_t, const int64_t, const int64_t) [K = long]: Device-side assertion `idx >= 0 && idx < gather_dim_size' failed. :0:rocdevice.cpp :2614: 2346741543469 us: 3634 : [tid:0x7fcd4f378700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016 Aborted (core dumped)

Does anyone know how to fix it? BTW, I'm using the AIT docker.

Thanks a lot.

Du

carlushuang commented 1 year ago

Hi Du. For AMD backend, the upstream AIT breaks a lot of things. We fixed them in our AMD fork of AIT currently: https://github.com/ROCmSoftwarePlatform/AITemplate, and plan to upstream in the near future. So for now you can use the AMD fork to try thoese examples, and feel free to reach out if any further problems.

duli2012 commented 1 year ago

Thanks, Carlus! I tried the AMD fork. However, when I just run: python3 benchmark_ait.py. It encountered the following errors: ERROR:concurrent.futures:exception calling callback for <Future at 0x7fd46f049610 state=finished returned tuple> Traceback (most recent call last): File "/usr/lib/python3.8/concurrent/futures/_base.py", line 328, in _invoke_callbacks callback(self) File "/usr/local/lib/python3.8/dist-packages/aitemplate/backend/profiler_runner.py", line 263, in callback_when_done raise RuntimeError(f"Failed to extract profiler result for {cmds}") RuntimeError: Failed to extract profiler result for ['./tmp/profiler/bmm_softmax_bmm_permute/batched_gemm_softmax_gemm_permute_1_hhh_TNT_256_64_256_64_64_32_8_8_2_16_16_1_16_4_PT', '768', '384', '384', '64', '64', '12']

Du

carlushuang commented 1 year ago

@duli2012 sorry for late reply. This is due to not applicable kernel will interrupt current profiling. But it should not affect the whole build process. In the end this example should be able to run.

yinghai commented 1 year ago

@carlushuang Hey sorry about the breakage. I wonder if it makes sense to add a rocm CI to guard it.

ipiszy commented 1 year ago

@carlushuang Please send a PR and we'll merge your fix into upstream, thanks!

carlushuang commented 1 year ago

@yinghai @ipiszy Thanks for mention the rocm backend CI. We have setup CI in the rocm fork and it works fine. There are several bug fixes to let everything works OK, as well as some porting of graph optimization from cuda to rocm backend.

About the rocm fix for upstream. Sure! We are going through some internal process about the upstream, but definitely that's doable. Will get back to this topic after the holiday.

cc @asroy