Closed Animaxx closed 2 months ago
There is a maximum buffer size with Metal, and the compute buffer cannot be split into smaller buffers. You can try using a smaller ubatch size to reduce the size of the compute buffer (eg. -ub 64
). The flash attention implementation merged today should also help reduce the size of the computer buffer (add -fa
to the command line to enable it). But ultimately, it seems that you are exceeding by far the amount of memory that can be allocated by Metal on this system even without the compute buffer (that's what the 8049.59 / 5461.34
means), and the only solution may be to reduce the size of the context.
trying to load a 7b model into iPhone 15 pro, since the model supports 32k context, if I set llama_context_params.n_ctx to 32k, it crashes, and here is the error:
-[MTLDebugDevice newBufferWithBytesNoCopy:length:options:deallocator:]:700: failed assertion `Buffer Validation newBufferWith*:length 0x86004000 must not exceed 2048 MB.
Observed that even set to a smaller number like 16k still facing the same problem
here is the full log:
in ggml-alloc the new_size
size_t new_size = ggml_dyn_tallocr_max_size(galloc->buf_tallocs[i]);
is 2248163328 but iPhone 15 pro should have 8GB RAM and turned on increased memory limit in project setting.