improved offload_activations

Great work @poedator, this seems good.

According the experiments you provided, results with and without offloading ,matches with paper's ppl (see picture below) . By turning on activation offloading, we have a little slow up in speed, but we gain significantly memory save. In the future, we can change to doing activation offloading as default behavior, but I don't insist(this will lead to more ram and cpu consumption).With this people can quantize comfortably on 3090 with 24GB gpu 65b, and 30b on 1080ti(need to be checked).

photo_2023-07-24_13-17-41

As for PR:

Please add better documentation to offloading activation https://github.com/Vahe1994/SpQR/blob/d71dcc29785b3c967d45c4a0c94d0fa4cd307040/main.py#L514C1-L514C57
In readme add few words about this options.

Vahe1994 / SpQR

improved offload_activations #27