ROCm / AITemplate

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
Apache License 2.0
11 stars 7 forks source link

AIT BERT fails with --encoders-only False #36

Closed causten closed 1 year ago

causten commented 1 year ago

commit e282ff06b56609e8a0ee8925192520f8ecce9186 rocm-5.3.0

git clone --recursive https://github.com/facebookincubator/AITemplate
DOCKER_BUILDKIT=1 ./docker/build.sh rocm
alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --ipc=host --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME:/dockerx/'
drun ait:latest

I ran these commands inside the container...

  export ROCM_PATH=/opt/rocm
  export ROC_USE_FGS_KERNARG=0
  cd /dockerx/code/latestait/
  pip3 uninstall -y aitemplate
  cd python
  rm -rf dist build
  python3 setup.py bdist_wheel
  pip3 install dist/*.whl
  cd examples/03_bert/
  python3 -m pip install transformers click torch
  HIP_VISIBLE_DEVICES=0 python3 benchmark_ait.py

I then waited until BS 64 and SEQ 384 were complete

[16:52:02] model_container.cpp:477: Benchmark runtime ms/iter: 48.4714
[16:52:07] model_container.cpp:477: Benchmark runtime ms/iter: 48.5939
[16:52:12] model_container.cpp:477: Benchmark runtime ms/iter: 48.4655
batch_size: 64, seq_length: 384, latency: 48.52039623260498
output_0 shape: [128, 384, 768]
2023-01-20 16:52:12,326 INFO <aitemplate.backend.target> Loading profile cache from: /root/.aitemplate/rocm.db

I then ran HIP_VISIBLE_DEVICES=0 python3 benchmark_ait.py --batch-size 1 --seq-length 384 --encoders-only False

Failed with

make: Leaving directory '/dockerx/code/latestait/examples/03_bert/tmp/BERT_fast_gelu_1_64'
make: Entering directory '/dockerx/code/latestait/examples/03_bert/tmp/BERT_fast_gelu_1_64'
hipcc -O3 -fPIC -fvisibility=hidden -std=c++17 -w -DCK_TIME_KERNEL=0 -Xclang -mlink-builtin-bitcode -Xclang /opt/rocm/amdgcn/bitcode/oclc_abi_version_400.bc -DCK_AMD_GPU_GFX90A --amdgpu-target=gfx90a -I/usr/local/lib/python3.8/dist-packages/aitemplate/3rdparty/composable_kernel -I/usr/local/lib/python3.8/dist-packages/aitemplate/3rdparty/composable_kernel/include/ -I/usr/local/lib/python3.8/dist-packages/aitemplate/3rdparty/composable_kernel/external/include/half/ -I/usr/local/lib/python3.8/dist-packages/aitemplate/3rdparty/composable_kernel/library/include/ -I/usr/local/lib/python3.8/dist-packages/aitemplate/3rdparty/composable_kernel/profiler/include/ -I/usr/local/lib/python3.8/dist-packages/aitemplate/3rdparty/../static/include -L/opt/rocm/rocrand/lib/ -lrocrand -DNDEBUG -x hip -c -o bmm_softmax_bmm_permute_8.obj bmm_softmax_bmm_permute_8.cpp
make: Leaving directory '/dockerx/code/latestait/examples/03_bert/tmp/BERT_fast_gelu_1_64'

make stderr: bert_embeddings_0.cpp:21:159: error: template argument for template type parameter must be a type
  auto device_instance = ck::tensor_operation::device::DeviceSparseEmbeddingsForwardLayernorm<ck::half_t, int64_t, ck::half_t, ck::half_t, float, ck::half_t, 256, 1, 256, 1, EMBEDDING_DIM, 1, 1, 3>{};
                                                                                                                                                              ^~~
/usr/local/lib/python3.8/dist-packages/aitemplate/3rdparty/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_sparse_embeddings_forward_layernorm.hpp:27:20: note: template parameter is declared here
          typename EmbElementwiseOperation,
                   ^
1 error generated when compiling for gfx90a.
make: *** [Makefile:9: bert_embeddings_0.obj] Error 1
make: *** Waiting for unfinished jobs....

2023-01-20 16:54:23,870 INFO <aitemplate.compiler.compiler> compiled the final .so file elapsed time: 0:00:37.271055
Traceback (most recent call last):
  File "benchmark_ait.py", line 354, in <module>
    compile_and_benchmark()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "benchmark_ait.py", line 340, in compile_and_benchmark
    mod = compile_module(
  File "benchmark_ait.py", line 224, in compile_module
    mod = compile_model(y, target, "./tmp", model_name)
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/compiler.py", line 260, in compile_model
    module = Model(
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 227, in __init__
    self.DLL = self._DLLWrapper(lib_path, num_runtimes, allocator_kind)
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 166, in __init__
    self.DLL = ctypes.cdll.LoadLibrary(lib_path)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 451, in LoadLibrary
    return self._dlltype(name)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: ./tmp/BERT_fast_gelu_1_64/test.so: cannot open shared object file: No such file or directory
Exception ignored in: <function Model.__del__ at 0x7f2ea7e3e790>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 257, in __del__
    self.close()
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 261, in close
    for ptr in list(self._allocated_ait_data):
AttributeError: 'Model' object has no attribute '_allocated_ait_data'

I then tried

HIP_VISIBLE_DEVICES=0 python3 benchmark_ait.py --encoders-only False

and it fails in the same way

illsilin commented 1 year ago

I apologize for keeping somewhat outdated installation and running instructions. Here is the best way to set up and run your tests:

1) pull and launch the dedicated rocm/AIT docker: docker pull rocm/composable_kernel:ait_rocm5.3 alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v ~/dockerx:/dockerx' drun rocm/composable_kernel:ait_rocm5.3

2) inside the docker, update the rocm/AIT code to the latest version: rm -rf AITemplate git clone --recursive https://github.com/ROCmSoftwarePlatform/AITemplate.git

3) refresh the installation: cd AITemplate/python/ pip3 uninstall -y aitemplate python3 setup.py bdist_wheel pip3 install dist/*.whl

4) run BERT: cd ../examples/03_bert/ HIP_VISIBLE_DEVICES=0 python3 benchmark_ait.py

After following all of these steps I can confirm that the HIP_VISIBLE_DEVICES=0 python3 benchmark_ait.py --batch-size 1 --seq-length 384 model runs fine, but the HIP_VISIBLE_DEVICES=0 python3 benchmark_ait.py --batch-size 1 --seq-length 384 --encoders-only False model throws the following error:

Traceback (most recent call last): File "benchmark_ait.py", line 354, in compile_and_benchmark() File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in call return self.main(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke return __callback(args, **kwargs) File "benchmark_ait.py", line 323, in compile_and_benchmark benchmark(batch_size, seq_length, hidden_size, mod, graph_mode, encoders_only) File "benchmarkait.py", line 171, in benchmark t, , __ = mod.benchmark_with_tensors( File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 605, in benchmark_with_tensors mean, std, ait_outputs = self.benchmark( File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 550, in benchmark inputs = self._dict_to_ordered_list(inputs, is_inputs=True) File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 322, in _dict_to_ordered_list raise ValueError( ValueError: Did not get correct number of inputs expected 1, got 3

We will try to have this fixed before the end of the month.

fsx950223 commented 1 year ago

Fixed in https://github.com/ROCmSoftwarePlatform/AITemplate/tree/merge_upstream

causten commented 1 year ago

Not fixed. I used the "merge_upstream" branch and it fails...

HIP_VISIBLE_DEVICES=0 python3 benchmark_ait.py --encoders-only False

make: Entering directory '/dockerx/AITemplate/examples/03_bert/tmp/BERT_fast_gelu_1_64'
hipcc -O3 -fPIC -fvisibility=hidden -std=c++17 -w -DCK_TIME_KERNEL=0 -Xclang -mlink-builtin-bitcode -Xclang /opt/rocm/amdgcn/bitcode/oclc_abi_version_400.bc -DCK_AMD_GPU_GFX90A --amdgpu-target=gfx90a -I/usr/local/lib/python3.8/dist-packages/aitemplate/3rdparty/composable_kernel -I/usr/local/lib/python3.8/dist-packages/aitemplate/3rdparty/composable_kernel/include/ -I/usr/local/lib/python3.8/dist-packages/aitemplate/3rdparty/composable_kernel/external/include/half/ -I/usr/local/lib/python3.8/dist-packages/aitemplate/3rdparty/composable_kernel/library/include/ -I/usr/local/lib/python3.8/dist-packages/aitemplate/3rdparty/composable_kernel/profiler/include/ -I/usr/local/lib/python3.8/dist-packages/aitemplate/3rdparty/../static/include -L/opt/rocm/rocrand/lib/ -lrocrand -DNDEBUG -x hip -c -o bmm_softmax_bmm_permute_8.obj bmm_softmax_bmm_permute_8.cpp
make: Leaving directory '/dockerx/AITemplate/examples/03_bert/tmp/BERT_fast_gelu_1_64'

make stderr: bert_embeddings_0.cpp:21:159: error: template argument for template type parameter must be a type
  auto device_instance = ck::tensor_operation::device::DeviceSparseEmbeddingsForwardLayernorm<ck::half_t, int64_t, ck::half_t, ck::half_t, float, ck::half_t, 256, 1, 256, 1, EMBEDDING_DIM, 1, 1, 3>{};
                                                                                                                                                              ^~~
/usr/local/lib/python3.8/dist-packages/aitemplate/3rdparty/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_sparse_embeddings_forward_layernorm.hpp:27:20: note: template parameter is declared here
          typename EmbElementwiseOperation,
                   ^
1 error generated when compiling for gfx90a.
make: *** [Makefile:9: bert_embeddings_0.obj] Error 1
make: *** Waiting for unfinished jobs....

2023-02-02 17:08:49,401 INFO <aitemplate.compiler.compiler> compiled the final .so file elapsed time: 0:00:36.827257
Traceback (most recent call last):
  File "benchmark_ait.py", line 354, in <module>
    compile_and_benchmark()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "benchmark_ait.py", line 340, in compile_and_benchmark
    mod = compile_module(
  File "benchmark_ait.py", line 224, in compile_module
    mod = compile_model(y, target, "./tmp", model_name)
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/compiler.py", line 260, in compile_model
    module = Model(
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 227, in __init__
    self.DLL = self._DLLWrapper(lib_path, num_runtimes, allocator_kind)
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 166, in __init__
    self.DLL = ctypes.cdll.LoadLibrary(lib_path)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 451, in LoadLibrary
    return self._dlltype(name)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: ./tmp/BERT_fast_gelu_1_64/test.so: cannot open shared object file: No such file or directory
Exception ignored in: <function Model.__del__ at 0x7efce696e0d0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 257, in __del__
    self.close()
  File "/usr/local/lib/python3.8/dist-packages/aitemplate/compiler/model.py", line 261, in close
    for ptr in list(self._allocated_ait_data):
AttributeError: 'Model' object has no attribute '_allocated_ait_data'
root@zt-dh170-13:/dockerx/AITemplate/examples/03_bert#

I was using commit...

commit 2eaed6cd171eaf4c8aeec931e74bb8bfb21cbe24 (HEAD -> merge_upstream, origin/merge_upstream)
Author: fsx950223 <fsx950223@gmail.com>
Date:   Thu Feb 2 00:16:38 2023 +0800

    fix a bug
fsx950223 commented 1 year ago

Could you run it in a new environment? https://github.com/ROCmSoftwarePlatform/AITemplate/blob/merge_upstream/python/aitemplate/backend/rocm/embedding/bert_embeddings.py#L44

carlushuang commented 1 year ago

@causten I checked on my local environment and --encoders-only True flag is OK to use. From the log it's pretty much like the CK version (from 3rdparty) is not updated. The simplest approach is to use a clean docker, and reinstall AIT from beginning , then do the test. If you already has a AIT repo cloned, after update, make sure use git submodule update to update all the submodule to corresponding version. Inside AIT, in case there is any code change inside the AIT, make sure rm -rf ~/.aitemplate to clean cache in case previous AIT version and current AIT version has any changes.

causten commented 1 year ago

It's the "False" I needed, it's True by default. but I'll repeat delete everything and repull rocm/composable_kernel:ait_rocm5.3

carlushuang commented 1 year ago

@causten I see. Another thing is, If you need to use --encoders-only False, then you need to make sure add this flag while building the model, as well as running it afterward, at the same time e.g. HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 benchmark_ait.py --encoders-only False to compile the model, then python3 benchmark_ait.py --batch-size 1 --seq-length 384 --encoders-only False

causten commented 1 year ago

It works. Thanks