Closed poedator closed 12 months ago
Great work @poedator, this seems good.
According the experiments you provided, results with and without offloading ,matches with paper's ppl (see picture below) . By turning on activation offloading, we have a little slow up in speed, but we gain significantly memory save. In the future, we can change to doing activation offloading as default behavior, but I don't insist(this will lead to more ram and cpu consumption).With this people can quantize comfortably on 3090 with 24GB gpu 65b, and 30b on 1080ti(need to be checked).
As for PR:
improving offloading in quantize() adding offloading in eval() tested in Nirvana