why don't you do the register_buffer inside the QuantizedLinear() init ?

vince62s commented 9 months ago

instead of here: https://github.com/Cornell-RelaxML/quip-sharp/blob/main/model/llama_nofuse.py#L258-L260 and same for self attention it would make things much easier to integrate the Class in other projects

tsengalb99 commented 9 months ago

We actually did change this internally recently and are planning to merge it in after NeurIPS. I agree it would make it easier to integrate QuIP#.

Get Outlook for Androidhttps://aka.ms/AAb9ysg

From: Vincent Nguyen @.> Sent: Wednesday, December 13, 2023 11:28:22 AM To: Cornell-RelaxML/quip-sharp @.> Cc: Subscribed @.***> Subject: [Cornell-RelaxML/quip-sharp] why don't you do th register_buffer inside the QuantizedLinear() init ? (Issue #17)

instead of here: https://github.com/Cornell-RelaxML/quip-sharp/blob/main/model/llama_nofuse.py#L258-L260 and same for self attention it would make things much easier to integrate the Class in other projects

— Reply to this email directly, view it on GitHubhttps://github.com/Cornell-RelaxML/quip-sharp/issues/17, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH6WZSCJPJYL5X2KEY6T6TTYJHQTNAVCNFSM6AAAAABATRE7HCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA2DAMJXGIZDQNA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

chu-tianxiang commented 9 months ago

Curiously, why do we need to scale before quantization? I thought it may be related to the Q/K/V fuse, but unfused layers are also scaled first.

tsengalb99 commented 9 months ago

It's just something we did, you can combine the scales in the unfused layers. The fused layers have a prescale to normalize the before they are scaled, but that's not strictly optimal. The unfused version of q/k/v gets like 0.02 better ppl on wikitext but is slower so we only released the fused versions.

Cornell-RelaxML / quip-sharp

why don't you do the register_buffer inside the QuantizedLinear() init ? #17