FMInference / FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.
Apache License 2.0
9.18k stars 548 forks source link

Offloading to disk does not work (opt-66b) #61

Closed Paethon closed 1 year ago

Paethon commented 1 year ago

I just tried running flexgen.flex_opt with the following command: python3 -m flexgen.flex_opt --model facebook/opt-66b --percent 0 0 100 0 100 0 --offload-dir ~/tmp/offload/ but this only fills the CPU memory until the process is killed by the OS and the folder ~/tmp/offload/ stays completely empty

System:

Ying1123 commented 1 year ago

Do you know which line the OOM happens at? One possible reason is that it happens during the weight download and weight conversion, so you got OOM even before FlexGen starts to work. https://github.com/FMInference/FlexGen/blob/f7b6869f11edc6dc96139fdce4d9521b352c2a6b/flexgen/opt_config.py#L121

In this case, you can rent a machine with more CPU ram and copy the converted weights.

Or you can implement a memory-efficient version of the weight conversion.

Ying1123 commented 1 year ago

This should be fixed by #69. It is merged into the main branch. Could you try it now? Also, this is a duplication of #11

Paethon commented 1 year ago

Sorry, was not available for a few days. Going to have a look, thanks! :+1: