Open BuildBackBuehler opened 6 months ago
Edit: Forgot to mention this (below) is all after a sleepless night...this issue + written, I'm looking back at an untouched version and I'm not sure I'm describing all of that right. Because the issue originated from Python being unable to find/reconcile the definition of "org_module" but that may have been an issue with using their torch.nn (tvm.nn). I'll update when I have something more concrete.
Alright so I have been struggling on this implementation...definitely spent over a day trying to make this work within MLC.
Currently I cannot reconcile int_llama_layer due to "ori_layer". I think I'm missing something or there's a small error with it. There's OmniLayerNorm
and OmniLlamaRMSNorm
. A. has "ori_layer_norm" while B. has "ori_norm"
At the end of the day, "ori_layer" equates to "input_layernorm" for A/B(?).
My presumption is that "ori_layer" ~= "ori_layer_norm". One is supposed to have an unquantized run with LlamaAttention/DecoderLayer/OmniMLP and A. OLN
. With the resultant of OLN except labeled as ori_norm
so you could plug it right into B. OLRMSnorm
no problem. The others being self-explanatory (and as-intended, also self-explanatory).
Overall, it is a bit chaotic but I get the appeal of a highly optimized, condensed version considering the nature of the project. I just wish that the DecodeLayer was just a tad more spaced out in order to make sense of the super().init().
For example, if what I'll refer to as QLA
later on:
org_module=ori_layer.self_attn,
config=config,
args=args,
)
was reconciled before the QLDL init (along with the ones for QLMLP & OLRMSnorm), then it would be a lot easier to follow:
So prior to QLDL's init you'd declare
ori_llama_attention = QLA(all the args but no wrap but no wrap but no wrap)
ori_MLP=QLMLP(all the args but no wrap but no wrap but no wrap)
ori_rms_norm = OLN(all the args but no wrap but no wrap but no wrap)
Then you'd just be looking at
class QuantLlamaDecoderLayer(nn.Module):
...
self.self_attn = ori_llama_attention
self.mlp = ori_MLP
self.input_layernorm = ori_rms_norm
self.post_attention_layernorm = ori_OLRMSnorm
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
...
which looks a lot cleaner/is 100x easier to follow IMO!
I probably screwed up somewhere but hopefully you get the idea. I doubt anyone can help me at this point, but it'd be greatly appreciated if you do end up seeing this in the next 12-36 hours and have the expertise π
MM yeah. Well, I don't know the ins-and-outs of MLC but I'd figured it would be beneficial to integrate Omni into its ecosystem. I still think that is true because I'd prefer to use tvm's implementation of 1/2/3D-type calc. aka their tvm.nn (+ others) as it is more platform-agnostic. Donno how it works w/ Omniquant pre-quantized models that are simply dropped in via compile but the Git Issues seem to indicate its not seamless. I'd originally figured integration would be like hooking up RGBY cables into an old SRT but with the 2nd-pass Quantization, it is complicated, simply because they separate unquantized from quantized.
Trying to make it work w/ their Extern/QuantizeMapping with args/config sharing is the crux. It probably is an easy integration if I simply create copies of the main files to act as iterative buffers. I also have created a script that takes the Quantization's Config Args (i.e., "w4a16g128"'s wbits, abits, etc.) and integrates it into its own class. Therein, I hope to use it as a hub/buffer before passing it along to the Omniquant/Quantizer scripts. And thus allow those + the int_llama_layer/ "prequant" files access to args over each iteration. I don't know if this is the best resolution because this is 10 paygrades above my rank.
Hi,
Sorry for the late response. I am so busy recently.
To integrate OmniQuant into MLC, you just save the fake quantization model.
And than, you can quantized the fake quantized model by MLC-LLM just like normal LLaMA models.
A more detailed instruction can be find in https://llm-tracker.info/howto/OmniQuant.
Hi, so as I've come to find out -- it appears that it is a bit more complicated than it is made to be in the .ipynb. As far as I can tell, one must infuse the Omni. quant. files into MLC before compilation of the MLC package (which is easy peazy).
Really I'm just wondering where "real quantization" with AGPTQ falls in this framework. As I primarily rely on my Macbook (Silicon/Metal/MPS, however you refer to it) I'm not even sure if it is feasible to use AGPTQ in any case. While it may not be impossible, it would definitely require a lot of work, a full-on conversion project with TinyGrad or something of the likes.
So I suppose I'm wondering if there is any equivalent in MLC or some 3rd party equivalent for this form of pre-quantization (at least I guess that's what it is referred as?) that works on the Metal platform. I figure perhaps the alternative would be to go on HF and download a pre-quantized AGPTQ version of my desired model, which it appears MLC can then convert those weights into its own formatting, then can use Omniquant for the quantization/compilation. Do I have that all correct?
In any case, I do have an RTX3070/Tesla M40...just imagine it wouldn't be up to the AGPTQ conversion task for the likes of Mixtral 8x22B or even Llama-70B π₯
Thanks!
And totally off-topic and perhaps not even necessary/additive but is there an int_mixtral_layer.py to be had/added? Or is it just as well to use the generic int_opt_layer.py? I wish I knew enough to make that judgment myself but there's only so much time in the day and I just appreciate using the fastest, most efficient solutions (and that is Omniquant, so thank you for your work, too!)
edit: Looks like a newer format, AWQ, actually outperforms AGPTQ in 4/3-bit, so it appears I have even more research/work to do π