<img src="fllama_header.png" alt="fllama image header, bird like Flutter mascot riding a llama. Text reads: FLLAMA. Run cutting-edge, free, large language models. Near ChatGPT quality. On any platform. iOS, macOS, Android, Windows, Linux, and Web. Metal GPU support." style="float: left; margin-right: 0px;" />
Platform | Status |
---|---|
Android | |
iOS | |
Linux | |
macOS | |
Web | |
Windows |
fllamaChatCompletionAsync
to fllamaChat
)fllamaChat
unless you're doing something funny with LLMs that isn't user-facing, it will act like a true text completion engine instead of a chatbot.dependencies:
fllama:
git:
url: https://github.com/Telosnex/fllama.git
ref: main
import 'package:fllama/fllama.dart';
String latestResult = "";
final request = OpenAiRequest( maxTokens: 256, messages: [ Message(Role.system, 'You are a chatbot.'), Message(Role.user, messageText), ], numGpuLayers: 99, / this seems to have no adverse effects in environments w/o GPU support, ex. Android and web / modelPath: _modelPath!, mmprojPath: _mmprojPath, frequencyPenalty: 0.0, // Don't use below 1.1, LLMs without a repeat penalty // will repeat the same token. presencePenalty: 1.1, topP: 1.0, // Proportional to RAM use. // 4096 is a good default. // 2048 should be considered on devices with low RAM (<8 GB) // 8192 and higher can be considered on device with high RAM (>16 GB) // Models are trained on <= a certain context size. Exceeding that # can/will lead to completely incoherent output. contextSize: 2048, // Don't use 0.0, some models will repeat the same token. temperature: 0.1, logger: (log) { // ignore: avoid_print print('[llama.cpp] $log'); }, ); fllamaChat(request, (response, done) { setState(() { latestResult = response; }); });
## Tips & Tricks
### Web
Web is __extremely__ slow, ex. on a MBP M2 Max with 64 VRAM, it does ~2 tokens/second with a 3B parameter model. It's best seen as validation for your users that you will strive to support free LLMs everywhere.
To install: copy the fllama_wasm* files from `example/web` to your app's `web` directory, then add the `<script>` tag in `example/web/index.html`.
### Recommended models
3 top-tier open models are in the [fllama HuggingFace repo](https://huggingface.co/telosnex/fllama/tree/main).
- __Stable LM 3B__ is the first LLM model that can handle RAG, using documents such as web pages to answer a query, on *all* devices.
> Mistral models via [Nous Research](https://nousresearch.com/).
They trained and finetuned the Mistral base models for chat to create the OpenHermes series of models.
- __Mistral 7B__ is best on 2023 iPhones or 2024 Androids or better.
It's about 2/3 the speed of Stable LM 3B and requires 5 GB of RAM.
- __Mixtral 8x7B__ should only be considered on a premium laptop or desktop,
such as an M-series MacBook or a premium desktop purchased in 2023
or later. It's about 1/3 the speed of Stable LM 3B and requires
26 GB of RAM.
### RAM Requirements
Roughly: you'll need as much RAM as the model file size.
If inference runs on CPU, that much regular RAM is required.
If inference runs on GPU, that much GPU RAM is required.
### Download files from HuggingFace at runtime
HuggingFace, among many things, can be thought of as GitHub for AI models.
You can download a model from HuggingFace and use it with fllama.
To get a download URL at runtime, see below.
```dart
String getHuggingFaceUrl(
{required String repoId,
required String filename,
String? revision,
String? subfolder}) {
// Default values
const String defaultEndpoint = 'https://huggingface.co';
const String defaultRevision = 'main';
// Ensure the revision and subfolder are not null and are URI encoded
final String encodedRevision =
Uri.encodeComponent(revision ?? defaultRevision);
final String encodedFilename = Uri.encodeComponent(filename);
final String? encodedSubfolder =
subfolder != null ? Uri.encodeComponent(subfolder) : null;
// Handle subfolder if provided
final String fullPath = encodedSubfolder != null
? '$encodedSubfolder/$encodedFilename'
: encodedFilename;
// Construct the URL
final String url =
'$defaultEndpoint/$repoId/resolve/$encodedRevision/$fullPath';
return url;
}
FLLAMA is licensed under a dual-license model.
The code as-is on GitHub is licensed under GPL v2. That requires distribution of the integrating app's source code, and this is unlikely to be desirable for commercial entities. See LICENSE.md.
Commercial licenses are also available. Contact info@telosnex.com. Expect very fair terms: our intent is to charge only entities, with a launched app, making a lot of money, with FLLAMA as a core dependency. The base agreement is here: https://github.com/lawndoc/dual-license-templates/blob/main/pdf/Basic-Yearly.pdf
flutter pub run ffigen --config ffigen.yaml
to make them available in Dart.rm -rvf Podfile.lock && rm -rvf Podfile && rm -rvf Pods && flutter clean
^ run in example/macos / example/ios when upgrading cpp files, or when getting cryptic errors about build cache.