Integration in other tools: cli & js (transformers.js)

MinishLab / model2vec

Model2Vec: Distill a Small Fast Model from any Sentence Transformer

MIT License

246 stars 8 forks source link

Integration in other tools: cli & js (transformers.js) #75

Open do-me opened 1 day ago

do-me commented 1 day ago

Hey folks,

this package is absolutely awesome! I'm always watching out for performant small models, so this is a goldmine for me. I have some questions/possible feature ideas for getting static models to support real life use cases.

CLI for embeddings: I'd love a simple CLI for embeddings, similar to what llama.cpp offers. The background is that small models can be quickly loaded and used to generate a query vector for an existing set of embeddings. My personal use case would be a minimal note taking app with advances search but low memory footprint. The major advantage here would be that one does not need to keep the model loaded (using VRAM) all the time.
An integration in transformers.js would be amazing! This way, downstream projects using embeddings like the one I'm working on SemanticFinder could be accelerated so much! Maybe you could ping @Xenova for this if interested. Alternatively, is there already a way to use a distllied static model in JS somehow? If so, could you document it somewhere?
Could you also open Discussions in this repo?

Really excited to give these models a try, thanks for building this!

stephantul commented 1 day ago

Hello @do-me!

Thanks for your issue. Sounds good all around, I've replied to each of your points below:

CLI sounds good. Note that we already offer a distillation CLI, which is documented in the last code block here. But from what I gather, you would like a CLI that takes as input:

A text query
A model name
A path to a set of documents

And then returns the most similar items from the set of documents as a JSON document. e.g.:

model2vec query -q "hello world" -m "minishlab/m2v_base_output" --corpus my_corpus.vec
{"response": 
["hello moon", 0.89,
 "hello lion", 0.85,
 "hello goodbye", 0.80
 ...
 ]
 }

And, to create a corpus, something like this?

model2vec create -i my_corpus_files/*.txt -m "minishlab/m2v_base_output" -o my_corpus.vec

The output JSON is probably malformed, I just typed it here 😄 . Let me know if this is what you are looking for, it should be pretty easy to build using an in-memory vector DB I previously made, reach.

You are the second person to ask for this today, so let's do it. I haven't work with transformers.js at all, but we'll figure it out probably. Our modeling footprint is super tiny, and we're very compatible with Hugging face (although not 100% transformers compatible)
Done, thanks for the suggestion.

Let me know! Thanks! Stéphan

do-me commented 1 day ago

This is pretty much exactly what I am looking for! Maybe I'd make a difference between 2 options: retrieving the embeddings vs. the similarity scores. When offering the latter, you'd probably want to integrate many different functions (Cosine Distance, Manhattan etc.) too.

The corpus creation function would be amazing! I wanted to build something similar with conventional embeddings and sqlite-vec but I'd prefer your solution for simplicity.

Anyway, such a query would allow for really cool static apps (without the overhead of having to run a server). E.g a minimal use case would be to query a personal directory notes/documents by persisting an index once. I'd have a few more questions here:

Do I understand correctly that thanks to the word-level embeddings and averaging there is no more context limitation (like 8192 tokens for many models) and input docs can be of arbitrary length?
Is there some kind of warm-up overhead? Like how long does a first time load & call of the model for embeddings take?

I opened this issue in transformers.js to keep track: https://github.com/xenova/transformers.js/issues/970

Super excited on what's to come!

stephantul commented 1 day ago

Cool! Thanks for opening that issue on Transformers js! I'll take a look and support where possible ❤️

Integrating multiple similarity scores and/or embeddings or raw text is fine, all supported in reach.

Do I understand correctly that thanks to the word-level embeddings and averaging there is no more context limitation (like 8192 tokens for many models) and input docs can be of arbitrary length?

Yep! We limit to 512 tokens by default, but you can just put this to None in the encode call to encode arbitrary texts.

Is there some kind of warm-up overhead? Like how long does a first time load & call of the model for embeddings take?

It takes about 27ms to load a model from disk, so that should be fine. Loading the embeddings should also be something on that order of magnitude.

scorpfromhell commented 1 day ago

Just to give you an idea of what is possible with the gguf format (llama.cpp & hence wllama supports this format) instead of onnx (Transformers.js supports onnx, not gguf) please do take a look at the demo app of the wllama project on GitHub.

If you're able to make output of Model2Vec work with either Transformers.js or wllama, it will be a great help for WebAI in general.

Thanks in advance. 🙏🏼😃🤞🏼

stephantul commented 16 hours ago

@scorpfromhell Thanks for the heads up, definitely taking a look at that as well.

Thanks for the input everyone!

xenova commented 6 hours ago

Regarding ONNX export, here's a simple conversion script I wrote quickly:

import torch
import numpy as np
from model2vec import StaticModel

# Load a pretrained Model2Vec model
model = StaticModel.from_pretrained("minishlab/M2V_base_output")

# Patch the forward method to separate arguments
original_forward = model.forward
def patched_forward(input_ids, offsets):
  return original_forward((input_ids, offsets))
model.forward = patched_forward

# Dummy data
texts = ['hello', 'hello world']
encodings = model.tokenizer.encode_batch(texts, add_special_tokens=False)
encodings_ids = [encoding.ids for encoding in encodings]
offsets = torch.from_numpy(np.cumsum([0] + [len(token_ids) for token_ids in encodings_ids[:-1]]))
input_ids = torch.tensor([token_id for token_ids in encodings_ids for token_id in token_ids], dtype=torch.long)

# Export the model
torch.onnx.export(model,                     # model being run
                  (input_ids, offsets),      # model input (or a tuple for multiple inputs)
                  "model.onnx",              # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=14,          # the ONNX version to export the model to
                  do_constant_folding=True,  # whether to execute constant folding for optimization
                  input_names=['input_ids', 'offsets'], # the model's input names
                  output_names=['embeddings'],  # the model's output names
                  dynamic_axes={'input_ids' : {0 : 'sequence_length'},    # variable length axes
                                'offsets' : {0 : 'sequence_length'},
                                'embeddings' : {0 : 'batch_size'},
                                }
                  )

And then optionally simplify the model with a tool like onnxsim

Will share example Transformers.js code shortly.

xenova commented 6 hours ago

Example transformers.js code:

import { AutoModel, AutoTokenizer, Tensor } from '@huggingface/transformers';

const model = await AutoModel.from_pretrained('minishlab/M2V_base_output', {
    config: { model_type: 'model2vec' },
    dtype: 'fp32',
    revision: 'refs/pr/1',
});

const tokenizer = await AutoTokenizer.from_pretrained('minishlab/M2V_base_output', {
    revision: 'refs/pr/2',
});

const texts = ['hello', 'hello world'];
const { input_ids } = await tokenizer(texts, { add_special_tokens: false, return_tensor: false });

const cumsum = arr => arr.reduce((acc, num, i) => [...acc, num + (acc[i - 1] || 0)], []);
const offsets = [0, ...cumsum(input_ids.slice(0, -1).map(x => x.length))];

const flattened_input_ids = input_ids.flat();
const model_inputs = {
    input_ids: new Tensor('int64', flattened_input_ids, [flattened_input_ids.length]),
    offsets: new Tensor('int64', offsets, [offsets.length]),
}
const { embeddings } = await model(model_inputs);
console.log(embeddings.tolist()); // output matches python version

A bit more manual since we don't yet support the model2vec model type, but if there's enough demand, we could add it 👍