Open ChettakattuA opened 1 year ago
Same for me. Initially I thought it was a hardware issue, but I've verified that's not the case
Are you also using --device_type mps? I'm thinking this may be a Mac problem.
My intel work laptop doesn't run it either but that may be because it doesn't have enough memory.
Are we the only ones seeing this?
@ChettakattuA: If you had waited forever you wouldn't be here now.
Apart from this, please consider that virtual memory might a bottleneck. Vicuna-7B-1.1-HF with mps claimed roughly 30 GB of memory during my test on an M2 Mac with 32 GB of RAM. That means with no other programs running it needs aroung 4-5 GB of swap and is therefore barely usable (5-10 minutes per request). The CPU and GPU load are both below 20 % during the handling of a request, because memory is the bottleneck. If your computation requires around 10 GB of swap or more (because you have even less than 30 GB of RAM available) it becomes so slow, that impatient contemporaries will feel like waiting "forever".
@PromtEngineer: I like the answers of the current default model vicuna-7B-1.1-HF - they are short but concise. The metal acceleration is welcome. Good work. But it's really unfortunate, that many modern Apple Silicon Macs cannot play out their enormous GPU performance due to a lack of RAM. If you have suggestions for an alternative model, that is less memory intensive or if we can vicuna-7B-1.1-H in this regard - I'd be looking forward to test these.
@trading-global The model files themselves are only 13GB. Why would they take up 30 GB of RAM?
Wizard’s quantized vicuna 7B does a good enough job for me when I’m using chatdocs with responses in less than 15 seconds
Hi, I managed to make it run both terminal way and using API but the result is taking forever to load. I use GPU and just have 1 document on source dir but still it doesnt seem to run?