futo-org / android-keyboard

Offical FUTO Keyboard Issue Tracker and Source Mirror of https://gitlab.futo.org/keyboard/latinime
Other
722 stars 23 forks source link

Source of Fine-Tuning Data? #238

Open nonetrix opened 3 months ago

nonetrix commented 3 months ago

When I click fine tune model where does this data come from? Is it collecting everything I type? Of course locally and hopefully not uploading it I would trust. I am personally fine with this and think that it is a good feature, however I think this could be carified. Additionally, it maybe should be kept disabled by default. In theory a attacker could gain root access and read these logs.

On three unrelated notes, perhaps also it should allow for putting your own data from a file. Also, in the future bigger models would be great and maybe even models on another computer with ollama or something similar if a user wishes. Finally, the loss function when fine tuning seems quite high, I think smaller loss is generally good for LLMs I don't think the training is being done in way that is as optimal as possible.

Also, the auto correct has been the best I've used! Likely due to using a LM or LLM or whatever, depends on your definition of large maybe. I think recent versions of iOS also employ a small LLM but not sure, so might be compariable to that maybe? This is definitely first one Android to do this to my knowledge

FetchFast commented 3 months ago

https://gitlab.futo.org/alex/keyboard-wiki/-/wikis/Keyboard-LM-docs says (Note: Finetuning is currently disabled by default as its effectiveness has not been properly evaluated, but the following applies if you enable it. Finetuning may not be stable) As you type things with the keyboard, your typed data is saved locally to temporary storage for later finetuning of the transformer LM. Finetuning is scheduled to run at least once every ~20 hours when your device is idle, plugged in and there's enough data. Under the hood, finetuning trains a LoRA adapter locally on your device, merges it with the original model and saves it. While the original data is deleted after finetuning, the finetuned model's weights may contain the data in some form or another, so we recommend avoiding sharing the finetuned model. You can import and export model files for backup, transferring finetuned models between devices, or importing custom/third-party models. If you want to make your own models, check out the Model Creation section. The files are in .gguf format but with extra metadata, defined in the GGUF Metadata section."

But frustratingly, I can't see a way to see the data that is being used to fine-tune, so we can't diagnose.