Open do-me opened 1 day ago
Hello @do-me!
Thanks for your issue. Sounds good all around, I've replied to each of your points below:
And then returns the most similar items from the set of documents as a JSON document. e.g.:
model2vec query -q "hello world" -m "minishlab/m2v_base_output" --corpus my_corpus.vec
{"response":
["hello moon", 0.89,
"hello lion", 0.85,
"hello goodbye", 0.80
...
]
}
And, to create a corpus, something like this?
model2vec create -i my_corpus_files/*.txt -m "minishlab/m2v_base_output" -o my_corpus.vec
The output JSON is probably malformed, I just typed it here ๐ . Let me know if this is what you are looking for, it should be pretty easy to build using an in-memory vector DB I previously made, reach.
You are the second person to ask for this today, so let's do it. I haven't work with transformers.js
at all, but we'll figure it out probably. Our modeling footprint is super tiny, and we're very compatible with Hugging face (although not 100% transformers
compatible)
Done, thanks for the suggestion.
Let me know! Thanks! Stรฉphan
The corpus creation function would be amazing! I wanted to build something similar with conventional embeddings and sqlite-vec but I'd prefer your solution for simplicity.
Anyway, such a query would allow for really cool static apps (without the overhead of having to run a server). E.g a minimal use case would be to query a personal directory notes/documents by persisting an index once. I'd have a few more questions here:
Super excited on what's to come!
Cool! Thanks for opening that issue on Transformers js! I'll take a look and support where possible โค๏ธ
Integrating multiple similarity scores and/or embeddings or raw text is fine, all supported in reach.
Do I understand correctly that thanks to the word-level embeddings and averaging there is no more context limitation (like 8192 tokens for many models) and input docs can be of arbitrary length?
Yep! We limit to 512 tokens by default, but you can just put this to None
in the encode call to encode arbitrary texts.
Is there some kind of warm-up overhead? Like how long does a first time load & call of the model for embeddings take?
It takes about 27ms to load a model from disk, so that should be fine. Loading the embeddings should also be something on that order of magnitude.
Just to give you an idea of what is possible with the gguf format (llama.cpp & hence wllama supports this format) instead of onnx (Transformers.js supports onnx, not gguf) please do take a look at the demo app of the wllama project on GitHub.
If you're able to make output of Model2Vec work with either Transformers.js or wllama, it will be a great help for WebAI in general.
Thanks in advance. ๐๐ผ๐๐ค๐ผ
@scorpfromhell Thanks for the heads up, definitely taking a look at that as well.
Thanks for the input everyone!
Regarding ONNX export, here's a simple conversion script I wrote quickly:
import torch
import numpy as np
from model2vec import StaticModel
# Load a pretrained Model2Vec model
model = StaticModel.from_pretrained("minishlab/M2V_base_output")
# Patch the forward method to separate arguments
original_forward = model.forward
def patched_forward(input_ids, offsets):
return original_forward((input_ids, offsets))
model.forward = patched_forward
# Dummy data
texts = ['hello', 'hello world']
encodings = model.tokenizer.encode_batch(texts, add_special_tokens=False)
encodings_ids = [encoding.ids for encoding in encodings]
offsets = torch.from_numpy(np.cumsum([0] + [len(token_ids) for token_ids in encodings_ids[:-1]]))
input_ids = torch.tensor([token_id for token_ids in encodings_ids for token_id in token_ids], dtype=torch.long)
# Export the model
torch.onnx.export(model, # model being run
(input_ids, offsets), # model input (or a tuple for multiple inputs)
"model.onnx", # where to save the model (can be a file or file-like object)
export_params=True, # store the trained parameter weights inside the model file
opset_version=14, # the ONNX version to export the model to
do_constant_folding=True, # whether to execute constant folding for optimization
input_names=['input_ids', 'offsets'], # the model's input names
output_names=['embeddings'], # the model's output names
dynamic_axes={'input_ids' : {0 : 'sequence_length'}, # variable length axes
'offsets' : {0 : 'sequence_length'},
'embeddings' : {0 : 'batch_size'},
}
)
And then optionally simplify the model with a tool like onnxsim
Will share example Transformers.js code shortly.
Example transformers.js code:
import { AutoModel, AutoTokenizer, Tensor } from '@huggingface/transformers';
const model = await AutoModel.from_pretrained('minishlab/M2V_base_output', {
config: { model_type: 'model2vec' },
dtype: 'fp32',
revision: 'refs/pr/1',
});
const tokenizer = await AutoTokenizer.from_pretrained('minishlab/M2V_base_output', {
revision: 'refs/pr/2',
});
const texts = ['hello', 'hello world'];
const { input_ids } = await tokenizer(texts, { add_special_tokens: false, return_tensor: false });
const cumsum = arr => arr.reduce((acc, num, i) => [...acc, num + (acc[i - 1] || 0)], []);
const offsets = [0, ...cumsum(input_ids.slice(0, -1).map(x => x.length))];
const flattened_input_ids = input_ids.flat();
const model_inputs = {
input_ids: new Tensor('int64', flattened_input_ids, [flattened_input_ids.length]),
offsets: new Tensor('int64', offsets, [offsets.length]),
}
const { embeddings } = await model(model_inputs);
console.log(embeddings.tolist()); // output matches python version
A bit more manual since we don't yet support the model2vec
model type, but if there's enough demand, we could add it ๐
Hey folks,
this package is absolutely awesome! I'm always watching out for performant small models, so this is a goldmine for me. I have some questions/possible feature ideas for getting static models to support real life use cases.
CLI for embeddings: I'd love a simple CLI for embeddings, similar to what llama.cpp offers. The background is that small models can be quickly loaded and used to generate a query vector for an existing set of embeddings. My personal use case would be a minimal note taking app with advances search but low memory footprint. The major advantage here would be that one does not need to keep the model loaded (using VRAM) all the time.
An integration in transformers.js would be amazing! This way, downstream projects using embeddings like the one I'm working on SemanticFinder could be accelerated so much! Maybe you could ping @Xenova for this if interested. Alternatively, is there already a way to use a distllied static model in JS somehow? If so, could you document it somewhere?
Could you also open Discussions in this repo?
Really excited to give these models a try, thanks for building this!