Closed louis030195 closed 1 year ago
Performance Comparison In our paper TSDAE we compare approaches for sentence embedding tasks, and in GPL we compare them for semantic search tasks (given a query, find relevant passages). While the unsupervised approach achieve acceptable performances for sentence embedding tasks, they perform poorly for semantic search tasks.
https://www.sbert.net/examples/unsupervised_learning/README.html#performance-comparison
There is also a possibility of computing weak labels through existing links / tags / closeness of notes as for supervised fine-tuning
We could also provide a huggingface hub fine-tuned sentence embedding model good for general purpose Obsidian vault semantic search (or multimodal)
Idea: make a script that each individual can run on its vault to aggregate its public vault data into a huggingface dataset (with some args like filter in/out only what is publicly shareable, like publish: true or some tag/folder) Then we can fine-tune an Obsidian note embedding model
How hard would it be to add something like this: https://twitter.com/rileytomasek/status/1603854647575384067?s=46&t=if935fDFIydWWmNtFn-R4g
How hard would it be to add something like this: https://twitter.com/rileytomasek/status/1603854647575384067?s=46&t=if935fDFIydWWmNtFn-R4g
@arminta7 thanks for the feedback :). It is indeed on the roadmap
TODOs:
This is important information that would influence whether we try to implement inference in JS directly (ONNX, TFJS, etc.) which makes it more difficult to fine-tune.