Out-of-memory during weight download and conversion

FMInference / FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.

Apache License 2.0

9.22k stars 549 forks source link

Out-of-memory during weight download and conversion #11

Closed xloem closed 1 year ago

xloem commented 1 year ago

I’m on a system hardlimited to 40GB of cpu ram + swap.

When I try to load opt-30b the process is killed from memory exhaustion. If I load the model manually using device_map="auto", offload_folder=“offload”, the load succeeds.

Is there a way to pass these flags manually or otherwise accommodate ram limits during initial loading?

Ying1123 commented 1 year ago

Do you know which line the out-of-memory happens at?

xloem commented 1 year ago

It happens while loading shards from disk inside Model.from_pretrained() on https://github.com/FMInference/FlexGen/blob/0342e2a0e93593b2c11f84be0e9f5d5bcb73e598/flexgen/opt_config.py#L146 .

freedmand commented 1 year ago

I'm also encountering these errors. @xloem were you able to modify the code to get it to work?

xloem commented 1 year ago

I modified that function to pass the kwparams I mentioned, and then also called it manually before anything else was constructed, so that more ram was free, and then got farther, but encountered a later crash that I haven't looked into yet.

EDIT: it looks like the second crash is because the policy needs changing for the model and system i'm using, and the addition of the kwargs does move by this issue. i personally also added code to wipe the transformers.utils.TRANSFORMERS_CACHE after the initial download if not enough disk space remained available.

james9001 commented 1 year ago

In case anyone else finds this thread and are in a similar situation to me (with the opt-13b model, using a 1080Ti with 11GB VRAM + 32GB CPU RAM, with a 2GB swap file, but unlike OP, able to enlarge it) - try enlarging your swapfile. I created a 16GB swapfile and now it works.

xloem commented 1 year ago

I’m observing this issue was closed without change or explanation and am guessing maybe it is out of scope for now or would need the changes introduced as a PR.

Ying1123 commented 1 year ago

I’m observing this issue was closed without change or explanation and am guessing maybe it is out of scope for now or would need the changes introduced as a PR.

Sorry that I misread the thread and thought the problem had been resolved. I reopened it and will do it soon.

Ying1123 commented 1 year ago

@xloem This should be fixed by #69. It is merged into the main branch. Could you try it now?

xloem commented 1 year ago

By inspection it looks like you’ve resolved the issue. I might delete the .bin file after conversion to save disk space, maybe you are and i missed it.

I tried to pull the changes but it looks like there’s been a force push and the new tip doesn’t merge with my old checkout. My test code doesn’t quite run yet against the new codebase but I’ll keep in mind you fixed this.

Thank you.