Closed theobjectivedad closed 3 weeks ago
As a workaround I can pass --user=root
to the docker run args although this obviously this isn't a good practice. If anyone else uses this workaround I recommend the following aphrodite-engine options: always use --load-format safetensors
and never use --trust-remote-code
.
Looking at the contents of /app/aphrodite-engine/.triton
:
root@6c985414e457:~/.triton# find .
.
./cache
./cache/0d823c7fdd45a2b9f6ef82f9d235a68d
./cache/0d823c7fdd45a2b9f6ef82f9d235a68d/_fwd_kernel.ttir
./cache/0d823c7fdd45a2b9f6ef82f9d235a68d/__grp___fwd_kernel.json
./cache/0d823c7fdd45a2b9f6ef82f9d235a68d/_fwd_kernel.ttgir
./cache/0d823c7fdd45a2b9f6ef82f9d235a68d/_fwd_kernel.llir
./cache/0d823c7fdd45a2b9f6ef82f9d235a68d/_fwd_kernel.ptx
./cache/0d823c7fdd45a2b9f6ef82f9d235a68d/_fwd_kernel.json
./cache/0d823c7fdd45a2b9f6ef82f9d235a68d/_fwd_kernel.cubin
./cache/e848d61dc399d578c7a9dd3c6ff8a1a1
./cache/e848d61dc399d578c7a9dd3c6ff8a1a1/_fwd_kernel.so
./cache/6e97c2a1f7a095255f6dd5de1807841d
./cache/6e97c2a1f7a095255f6dd5de1807841d/cuda_utils.so
./dump
./dump/0d823c7fdd45a2b9f6ef82f9d235a68d
./dump/0d823c7fdd45a2b9f6ef82f9d235a68d/_fwd_kernel.ttir
./dump/0d823c7fdd45a2b9f6ef82f9d235a68d/_fwd_kernel.ttgir
./dump/0d823c7fdd45a2b9f6ef82f9d235a68d/_fwd_kernel.llir
./dump/0d823c7fdd45a2b9f6ef82f9d235a68d/_fwd_kernel.ptx
./dump/0d823c7fdd45a2b9f6ef82f9d235a68d/_fwd_kernel.cubin
root@6c985414e457:~/.triton# more ./cache/0d823c7fdd45a2b9f6ef82f9d235a68d/_fwd_kernel.json
{"num_warps": 8, "num_ctas": 1, "num_stages": 1, "enable_warp_specialization": false, "enable_persistent": false, "constants": {"14": 1, "31": 1, "35": 1, "36": 128, "37": 128, "38": 1
28}, "debug": null, "target": {"capability": 86, "num_warps": 8, "enable_fp_fusion": true}, "AMDGCN_ENABLE_DUMP": false, "DISABLE_FAST_REDUCTION": false, "DISABLE_MMA_V3": false, "ENAB
LE_TMA": false, "LLVM_IR_ENABLE_DUMP": false, "MLIR_ENABLE_DUMP": false, "TRITON_DISABLE_LINE_INFO": false, "device_type": "cuda", "shared": 65538, "name": "_fwd_kernel_0d1d2d3d4d5d67d
8d9d10de11e12d1314c15de16de17de18de19de20de21de22de23de24de25de26de27de28de29de30e31c32de33de34de35c", "clusterDims": [1, 1, 1]}
That's a weird one - the process should have rw access to the /app/aphrodite-engine
directory. I guess we'll need to explicitly modify the perms for the directory in the Dockerfile. Thanks for the report.
Cool - I'll send over a PR
Your current environment
Note I am running
alpindale/aphrodite-engine:latest
as of 2024-05-08.Additionally, here is the docker run command I'm using (triggered via a Makefile):
🐛 Describe the bug
Description
When executing a completion request I'm getting the exception below. I believe the root cause is the official container user is running as UID 1000 and
/app/aphrodite-engine
is owned byroot:root
. Seems to only happen under heavy load. Can replicate with unquantized version of miqu-1-70b-sf.Full error message