Closed Cydia2018 closed 11 months ago
Sorry for the late reply! You mentioned that the operator is embedded into the llama model. How do you embedded that, and which llama model did you try. Since there are two set of llama model definition in mlc-llm, one is original version, and one is
new nn.Module
version. And the latrer one is still in progress.
@cyx-6 We embedded the attention operator in the original version, and calling method is as follows:
attn_module = get_extern_module(shape_q, shape_k, shape_v,
(bsz, q_len, self.num_query_heads, self.head_dim))
attn_output = attn_module(query_states, key_states, value_states)
I see. Actually, we cannot apply ExternModule
onto the original model. ExternModule
is the new nn.Module
interface, from tvm.relax.frontend.nn
. But original llama model is from tvm.relax.testing
, which is another interface and not compatible with the new nn.Module
interface.
@cyx-6 I get it, really thanks.
This question is related to https://github.com/apache/tvm/pull/15487
I tried to embed AMD's attention operator directly in the llama model. The model could be compiled normally, but I encountered a runtime error: Cannot find PackedFunc attention in either Relax VM kernel library.
The operator file is as follows:
@cyx-6 thanks lot for your work, looking forward to your reply
cc @quic-sanirudh