ggerganov / llama.cpp

LLM inference in C/C++
MIT License
68.24k stars 9.79k forks source link

Support CoreML like whisper.cpp? #1714

Open realcarlos opened 1 year ago

realcarlos commented 1 year ago

I have tried whisper.cpp on my iPhone and it runs very fast , so I wonder if it is possible that llama.cpp could support it. thank you .

ggerganov commented 1 year ago

I think we might be able to add offload the prompt processing / perplexity calculation to CoreML in a similar way as we did with Whisper Encoder. Should be ~x3 faster compared to current implementation

Probably for text generation we won't be able to

pehch commented 1 year ago

Hey @ggerganov! With all the recent advancements in increasing context length, do you think it is worthwhile to add support for CoreML offloading for prompt processing to the new roadmap?

vimpunk commented 1 year ago

@ggerganov, is there any update on this?

ggerganov commented 1 year ago

I'm afraid I won't have time to look into this anytime soon. Having Metal-based prompt processing implemented in master makes this feature less important, although it might have benefits when using ANE with F16 models

vimpunk commented 1 year ago

Thanks for the answer @ggerganov. To confirm I understand correctly, since ANE only works with F16 at minimum, it doesn't benefit smaller quantized models, correct? If so, I wonder if scaling up only the matrices that are sent to ANE would not be worth the potential perf and energy usage benefit at the cost of more RAM usage. This depends on which parts of the LLM arch could benefit from ANE.

If you could tell me more info, I may attempt an implementation just to see. The key question is what parts of the LLM would benefit from being executed on ANE?

qdrddr commented 7 months ago

Description

Please consider adding Core ML model package format support to utilize Apple Silicone Nural Engine + GPU.

Additional Context

List of Core ML package format models

https://github.com/likedan/Awesome-CoreML-Models

Success Criteria Utilize both ANE & GPU, not just GPU on Apple Silicon

qdrddr commented 7 months ago

Additional information hope can be handy about running LLMs locally on Apple Silicone. Core ML is a framework that can redistribute workload across CPU, GPU & Nural Engine (ANE). ANE is available on all modern Apple Devices: iPhones & Macs (A14 or newer and M1 or newer). Ideally, we want to run LLMs on ANE only as it has optimizations for running ML tasks compared to GPU. Apple claims "deploying your Transformer models on Apple devices with an A14 or newer and M1 or newer chip to achieve up to 10 times faster and 14 times lower peak memory consumption compared to baseline implementations".

  1. To utilize Core ML first, you need to convert a model from TensorFlow, PyTorch to Core ML model package format using coremltools (or simply utilize existing models in Core ML package format ).
  2. Second, you must now use that converted package with an implementation designed for Apple Devices. Here is the Apple XCode reference PyTorch implementation.

https://machinelearning.apple.com/research/neural-engine-transformers

qdrddr commented 7 months ago

You might also be interested in another implementation Swift Transformers. Example of CoreML application https://github.com/huggingface/swift-chat

EnderRobber101 commented 5 months ago

Any updates?

qdrddr commented 5 months ago

2024 CoreML Updates https://developer.apple.com/documentation/updates/coreml

WWDC 2024 sessions about CoreML: https://developer.apple.com/wwdc24/10160 https://developer.apple.com/wwdc24/10218 https://developer.apple.com/wwdc24/10161