Closed ndrean closed 8 months ago
About this: I want to query to get all my pictures for a given subject. You make an audio to query them.
0) You upload a picture to a bucket and save the URL and the caption (Image-To-Text).
1) You record an audio and transcript it into text. This is "more or less" easy (the Javascript is not straightforward). But it works well.
2) I want to find images whose caption (saved in the database) is "closely related" to the audio (query text transcript).
You can do:
Full-Text-Search with keywords. This is a direct search into the database for captions containing these keywords.
Extract keyword with NER. However, the models are trained for specific keyword finding, such as location, names etc so this is not relevant since there is no real entity when you do a general search.
Similarity-vector search. You find the nearest neighbours between the embeddings with an approximate nearest neighbour: ANN). This means you need to firstly embed / map some text into a vector space for a given model, save it as an index, and then run an ANN algorithm. More precisely, the program is the following. When you save an uploaded image URL into the database, you compute its embedding using the model "sentence-transformers/paraphrase-MiniLM-L6-v2" and save it into the db. More precisely, an index. When you query using an audio, you firstly perform a Speech-To-Text (Whisper, see below), and then again compute its embeddings with the same model. Then make an ANN search to find some images that approximate the audio. This post gives some directions but it uses the archived library ExFaiss. It says we should favor using HSNWlib instead. Not yet working. Help needed!
A bit of reading: https://www.elastic.co/what-is/vector-search
The "easy" part: "speech-to-text". It uses de native browser MediaStream Record API and returns a WAV file (in mono audoi/pcm format) to feed a Bumblebee.Audio.speech_to_text_whisper
module that load the model "openai/whisper-small".
You allow_upload(:speech, accept: :any, auto_upload: true, progress: &handle_progress/3)
.
The Javascript hook below is attached to a button to toggle on/off audio recording. It uses the MediaRecorder
API.
A simple HTML like this:
<audio id="audio" controls></audio>
<button
id="record"
class="bg-blue-500 hover:bg-blue-700 text-white font-bold px-4 rounded"
type="button"
phx-hook="Audio"
disabled={@micro_off}
>
<Heroicons.microphone
outline
class="w-6 h-6 text-white font-bold group-active:animate-pulse"
/>
<span>Record</span>
</button>
The Audio
"hook":
export default {
mounted() {
let mediaRecorder;
let audioChunks = [];
const recordButton = document.getElementById("record");
const audioElement = document.getElementById("audio");
_this = this;
recordButton.addEventListener("click", () => {
if (mediaRecorder && mediaRecorder.state === "recording") {
mediaRecorder.stop();
recordButton.textContent = "Record";
} else {
navigator.mediaDevices.getUserMedia({ audio: true }).then((stream) => {
mediaRecorder = new MediaRecorder(stream);
mediaRecorder.start();
recordButton.classList.remove("bg-blue-500", "hover:bg-blue-700");
recordButton.classList.add(
"bg-green-500",
"hover:bg-green-700",
"animate-pulse"
);
recordButton.textContent = "Stop";
mediaRecorder.addEventListener("dataavailable", (event) => {
audioChunks.push(event.data);
});
mediaRecorder.addEventListener("stop", () => {
const audioBlob = new Blob(audioChunks);
console.log(audioBlob);
audioElement.src = URL.createObjectURL(blob);
_this.upload("speech", [audioBlob]);
audioChunks = [];
recordButton.classList.remove(
"bg-green-500",
"hover:bg-green-700",
"animate-pulse"
);
recordButton.classList.add("bg-blue-500", "hover:bg-blue-700");
});
});
}
});
},
};
The Elixir code is copied from the documentation:
def handle_progress(:speech, entry, socket) when entry.done? do
socket
|> consume_uploaded_entry(entry, fn %{path: path} ->
:ok = File.cp!(path, @tmp_wav)
{:ok, @tmp_wav}
end)
audio_task =
Task.Supervisor.async(
App.TaskSupervisor,
fn ->
Nx.Serving.batched_run(Whisper, {:file, @tmp_wav})
end
)
{:noreply, assign(socket, audio_ref: audio_task.ref, micro_off: true, speech_spin: true)}
end
# intermediate progress, mandatory piece of code for any `allow_upload` (eg :speech, :image_list)
# this removes the `if entry.done?` switch above
def handle_progress(_, _), do: {:noreply, socket}
The Bumblebee model runner:
defmodule App.Whisper do
def serving do
{:ok, whisper} = Bumblebee.load_model({:hf, "openai/whisper-small"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-small"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-small"})
{:ok, generation_config} = Bumblebee.load_generation_config({:hf, "openai/whisper-small"})
Bumblebee.Audio.speech_to_text_whisper(
whisper,
featurizer,
tokenizer,
generation_config,
chunk_num_seconds: 30,
task: :transcribe,
# stream: true,
defn_options: [compiler: EXLA]
)
end
end
It remains to code a handle_info({ref, %{chunks: [%{text: text]}, socket)
to use the result of the Task.
def handle_info({ref, result}, %{assigns: assigns} = socket) do
Process.demonitor(ref, [:flush])
cond do
#other tasks (Image-To-Text, example-list-task...)
[...]
#Task speech-to-text
is_map(result) && Map.get(result, :chunks) != nil ->
Path.expand("priv/static/image_uploads/tmp.wav")
|> File.rm!()
%{chunks: [%{text: text}]} = result
# TODO: Task "create embedding from this text "result", and "run ANN search"
{:noreply, assign(socket, transcription: String.trim(text), micro_off: false)}
end
end
Super slow progress, should spend more time on it. I upload to Cloudfare R2 every image, and save into an SQLite db the returned URL, width, height and "Image-To-Text" caption. The "FTS" uses embedded FTS5 (virtual table). For vector search, add the dependency sqlite_vss.
1) Full-Text-Search with a simple input. Allow several keywords separated by a white space.
defmodule App.Repo do
use Ecto.Repo,
otp_app: :app,
adapter: Ecto.Adapters.SQLite3
end
defmodule App.Image do
use Ecto.Schema
import Ecto.Query
alias App.{Image, Repo, User}
@primary_key {:id, :id, autogenerate: true, source: :rowid}
schema "images" do
field(:url, :string)
field(:caption, :string)
field(:width, :integer)
field(:height, :integer)
field(:rank, :float, virtual: true)
belongs_to(:user, User)
timestamps()
end
def changeset(%{} = params) when is_map(params) do
Ecto.Changeset.cast(%Image{}, params, [:url, :caption, :width, :height])
end
def insert(%{} = params) when is_map(params) do
changeset(params)
|> Repo.insert!()
end
def search(q) do
query = String.replace(q, " ", " OR ")
from(img in Image,
select: [:url, :rank, :id, :caption],
where: fragment("images MATCH ?", ^query),
order_by: [asc: :rank]
)
|> Repo.all()
end
end
#page_live.ex
def handle_event("search", %{"query" => ""}, socket) do
{:noreply, assign(socket, message: "")}
end
def handle_event("search", %{"query" => q}, socket) do
l = App.Image.search(q)
message = "#{length(l)} images found for '#{q}'"
# just return this info without displaying all the images via their URL
{:noreply,
assign(socket,
message: message,
fts_images: Enum.map(l, & &1.url),
search_form: to_form(%{"query" => ""}, id: "query")
)}
end
2) Using Semantic search aka vector search.
We will use embeddings and ANN. This is based on SentenceTransformers.
2.1) Produce an embedding from the I2T caption
2.2)Produce an embedding from the S2T (audio transcription)
with Bumblebee.Text.text_embedding.
defmodule App.TextEmbedding do
def serving do
embedding_model = "sentence-transformers/paraphrase-MiniLM-L6-v2"
# model = "intfloat/e5-large"
{:ok, model_info} = Bumblebee.load_model({:hf, embedding_model})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, embedding_model})
serving = Bumblebee.Text.TextEmbedding.text_embedding(model_info, tokenizer)
end
end
#Application.ex
[...]
children = [
...
{Nx.Serving,
serving: App.TextEmbedding.serving(),
name: TextEmbedding,
batch_size: 8,
batch_timeout: 100
},
]
...
SQLITE_VSS set up:
config :app, App.Repo,
database: "./priv/sqlite.db",
pool_size: 5,
show_sensitive_data_on_connection_error: true,
load_extensions: [SqliteVss.loadable_path_vector0(), SqliteVss.loadable_path_vss0()]
Not sure yet how to run ANN along with (probably) pre-filtered data (the images belonging to a user)
EDITED: The sqlite_vss
is a bit obscure. I imagine you save the vector into the db and this extension can compute some distance algorithm.
[EDITED] πππ₯΅
Here I tried HSWNLib which seems close to Facebook's FAISS. Seems that you have a kind of database with an index. However, the documentation is very thin (sic).
My interpretation (!)
It seems you need to run a GenServer to get the "serving" and instantiate the index.
I use the model "sentence-transformers/paraphrase-MiniLM-L6-v2" based on the BertModel, already implemented in Bumblebee. Its dimension is 384.
Why transformer? Because it carries context, so closely related sentences should be close with some distance! A short and very clear intro.
For the distance, it must be a matter of experience on which one to use between the ("traditionnal") euclidean distance L2, the Manhattan distance L1 or cosine similarity (not a norm, but the inner product normalized by the L2).
When I upload an image, I compute an embedding on the caption text and add the tensor into the index.
When I make an audio transcription, I run an embedding, get a tensor and make a knn_query
against the index with this tensor.
Tests
I uploaded 3 pictures, one with a boat, one of a city, one with a car. I say something in the mic and it finds the images! Amazing!
input: audio transcription----------
[(app 0.1.0) lib/app_web/live/page_live.ex:216: AppWeb.PageLive.handle_info/2]
text #=> " A vehicle in a parking"
^^^
result #=> %App.Image{
__meta__: #Ecto.Schema.Metadata<:loaded, "images">,
id: 3,
url: "https://s3.eu-west-3.amazonaws.com/dwyl-imgup/test.webp",
caption: "a silver car parked on the side of a road",
^^^
width: 960,
height: 656,
embedding: [-0.1475222408771515, -0.04520482197403908, 0.05124518647789955, ...],
user_id: nil,
user: #Ecto.Association.NotLoaded<association :user is not loaded>,
}
π
input: audio transcription----------
[(app 0.1.0) lib/app_web/live/page_live.ex:212: AppWeb.PageLive.handle_info/2]
text #=> " A boat next to the beach.
^^^
result #=> %App.Image{
__meta__: #Ecto.Schema.Metadata<:loaded, "images">,
id: 1,
url: "https://s3.eu-west-3.amazonaws.com/dwyl-imgup/test.webp",
caption: "a boat is docked in the water near a tree",
^^^
width: 578,
height: 551,
embedding: [-0.11860480159521103, 0.07705779373645782, ...],
user: #Ecto.Association.NotLoaded<association :user is not loaded>,
}
π
Amazing, with just a few lines of code!
The result from knn_search
is in the following form: a label which is the index position of the result (can be resultS is you fixe k>1).
{:ok, label, distance} =
{:ok,
#Nx.Tensor<
u64[1][1]
EXLA.Backend<host:0, 0.2015035495.2514354192.41223>
[
[2]
^^^index
]
>,
#Nx.Tensor<
f32[1][1]
EXLA.Backend<host:0, 0.2015035495.2514354192.41224>
[
[0.18454807996749878]
]
>}
Code
Launch the Embedding model and instantiate the index: read from the existing indexes - saved into a file - or create one.
#Application
children = [..., App.TextEmbedding, ...]
defmodule App.TextEmbedding do
use GenServer
# use ENV VAR for the indexes file rather than hardcoded
def start_link(_) do
GenServer.start_link(__MODULE__, {}, name: __MODULE__)
end
def init(_) do
# use existing index file or create a new one
{:ok, index} =
case File.ls!(Path.expand("priv")) |> Enum.member?("indexes.bin") do
false ->
HNSWLib.Index.new(_space = :l2, _dim = 384, _max_elements = 200)
true ->
HNSWLib.Index.load_index(:l2, 384, Path.expand("priv/indexes.bin"))
end
{:ok, {nil, nil, index}, {:continue, :load}}
end
# async load the model to allow other processes to continue
def handle_continue(:load, {_,_,index}) do
{:ok, %{model: _model, params: _params} = model_info} =
Bumblebee.load_model({:hf, "sentence-transformers/paraphrase-MiniLM-L6-v2"})
{:ok, tokenizer} =
Bumblebee.load_tokenizer({:hf, "sentence-transformers/paraphrase-MiniLM-L6-v2"})
{:noreply, {model_info, tokenizer, index})
end
# called in the Liveview "mount" callback
def serve() do
GenServer.call(__MODULE__, :serve)
end
def handle_call(:serve, _from, {model_info, tokenizer, index} = state) do
serving = Bumblebee.Text.TextEmbedding.text_embedding(model_info, tokenizer)
{:reply, {serving, index}, state}
end
end
#page_live.ex
def mount(_params, _session, socket) do
{serving, index} = App.TextEmbedding.serve()
{:ok, assign(socket, serving: serving, index: index, ...}
# def handle_info({:ref, result}, %{assigns: assigns} = socket
cond do
# return from Image-To-Text, compute embedding and then Upload into a bucket Task
Map.get(assigns, :I2T_task_ref) == ref ->
%{results: [%{text: label}]} = result
# comptue the embedding on the caption
%{embedding: data} = Nx.Serving.run(serving, label)
# add as an index
HNSWLib.Index.add_items(index, data)
# since the image URL and caption is saved in the db, we also save the new index into a file
:ok = HNSWLib.Index.save_index(index, Path.expand("priv/indexes.bin"))
# test check
#HNSWLib.Index.get_current_count(index)
# update database
db_img =
case assigns.db_img do
nil ->
App.Image.insert!(%App.Image{}, %{embedding: Nx.to_flat_list(data)})
img ->
App.Image.update!(img, %{embedding: data})
end
...
...
# return audio transcription, and compute an embedding and insert into the index
is_map(result) && Map.get(result, :chunks) != nil ->
Path.expand("priv/static/image_uploads/tmp.wav")
|> File.rm!()
%{chunks: [%{text: text}]} = result
%{embedding: data} = Nx.Serving.run(serving, text)
with %App.Image{} = result <- handle_knn(data, index) do
{:noreply,
assign(socket,
transcription: String.trim(text),
micro_off: false,
search_result: result
)}
else
...
I copied literally the example in the Github Repo:
def handle_knn(data, index) do
case HNSWLib.Index.get_current_count(index) do
{:ok, 0} ->
{:error, "no entries in index"}
{:ok, _c} ->
case HNSWLib.Index.knn_query(index, data, k: 1) do
{:ok, label, _distance} ->
{:ok, binary_embedding} =
label[0]
|> Nx.to_flat_list()
|> then(&HNSWLib.Index.get_items(index, &1))
binary_embedding
|> Enum.map(&Nx.from_binary(&1, :f32))
|> Nx.stack()
|> Nx.to_flat_list()
|> then(&App.Repo.get_by!(App.Image, %{embedding: &1}))
{:error, msg} ->
{:error, msg}
end
end
end
From zero to hero! So the branch where speech-and-find works π₯΅: https://github.com/ndrean/image-classifier/tree/upload-save-db
It is based on 2 models: "openai/whisper-tiny" for the Text-To-Audio transcription, and the transformer model "sentence-transformers/paraphrase-MiniLM-L6-v2" (based on the BertModel) to compute the embeddings from text. It finally uses HSWNLib to run the ANN algorithm.
The uploads are saved in a bucket and the server gets back the URL. Note that to run ML to get a caption on an image, you need the binary so the upload is done server-side. On each upload, we compute the embedding and save it in the database.
I changed from SQLite to Postgres firstly because you can use arrays of floats with Postgres whilst you need to serialise with SQLite. The Full-Text-Search has to be rewritten but this is not the important point here. FTS is more an illustration of the old "primitive non-semantical" search (but can be relevant, so its good to know how to use it).
The ANN sorts on the distance. However, not all responses are semantically meaningful. Say you want the 3 closest embeddings, the algorithm will find them, even if the answers are "far". The answer should be improved with some cut-off distance given some kind of normalisation, probably using the :cosine
slmilarity. To be continued with more tests.
Such an awesome work and great read, as always @ndrean ! Found myself reverting to the link you've sent a few weeks back (https://www.youtube.com/watch?v=QdDoFfkVkcw) to freshen up my embedding knowledge.
I was implementing simple DB with SQLite just to learn (haven't had the chance to use it) but because of this, I might as well change it to Postgre just so I could save embeddings and make a vector database out of it.
I've always had some trouble what to do after I have these embeddings - how to do a similarity search. Thanks for showcasing three alternatives on how to achieve this!
If you want to push a PR, feel free to do so. Otherwise, I'll check the code and do it whilst crediting you, because this is awesome to have in this small project!
Posting the following links, which seems to details the three methods @ndrean mentioned π
From what I've seen, because you're using sentence-transformers/all-MiniLM-L6-v2
, since it was trained using cosine similarity , using cosine similarity for the index will produce the most accurate result.
According to the Pinecone link:
Remember the general principle we mentioned in the introduction of this post when selecting the similarity metric for your index: use the same similarity metric that was used to train your embedding model. Otherwise, you should experiment with various other similarity metrics to see if you can produce even better results (for instance, if you donβt know what similarity metric was used in the embedding mode or if the method the vectors were created with does not have such a metric in the generation process ).
ah thanks @LuchoTurtle! sentence-transformers/all-MiniLM-L6-v2 is trained using cosine
. ok! I did not find this.
Just change the definition in the GenServer:
HNSWLib.Index.new(_space = :l2, _dim = 384, _max_elements = 200)
to
HNSWLib.Index.new(_space = :cosine _dim = 384, _max_elements = 200)
This makes sense. We are looking for "similar to the caption" and these are short and descriptive. When you ask for an image, your question is short and descriptive (normally!). So captions and questions live in a similar world.
Notice I didn't produce any original code but rather completed the puzzle. So:
One answer is a specialised vector database. But I didn't want to add a side container running Elastisearch for example. Maybe a vector extension of a database with search capabilities? pg_vector
for Postgres, and sqlite_vss
for SQLIte. I don't have enough knowledge.
So an external algorithm? Elixir/Nx has an "alpha"-response with HSWNLib. It is easy to understand how it works. You add every new embedding to an "index database" and then knn_query
any input (as an embedding) against this index. Easy enough for me to understand.
So quite a few moving parts, definitely needs lots of technical knowledge to dive into this world.
Find 3 closest neighbours:
# "index" is the "index database"
# "data" is the input embedding
HNSWLib.Index.knn_query(index, data, k: 3)
^^^
Why do I save the embeddings in the database? Because the knn_query
response is a set of (casi) embeddings so I have to query the DB to find the images via the embeddings.
One detail: when you upload an image, you need to give a unique name. I though that this name could be the SHA1 but I believe you may prefer another solution.
Lastly, if you guys are interested with this feature, I should rewrite a new branch where I inject the 3 main ingredients: upload, speech-to-text, and the semantic search. This should be easier step by step than you digging into my code.
Just an addition as I forgot, and to save you time: don't save the embedding, but save the current index once you decide to add the embedding into the Index
as HNSWLib builds it incrementally. You then only need to query for the matching index with your found index. Furthermore, the cosine
does not work for embeddings but for normalised embeddings as I discovered. So I had a first bad idea to save the embedding and use it :) , but when I gave cosine
a try, I realised it would save space and just work if you work directly with indexes instead, and is easier. I had just concerns with multiple access on this index file if several users were uploading images at the same time (race condition?).
What I am saying might seem total rubbish, but once you just read HNSWLib Readme (and there is nothing more to read) and think about it, you will understand immediately. π€―
I thought it could be a nice add-on to (try to) visualise a bit the process I came up with (I will write something on dev.to for me not to forget and cite you as the proper implementation if you agree).
You have images saved in the cloud. You want to find some special ones. How?
You can record an audio that describes an image to find some corresponding images among the ones you uploaded.
How to proceed?
1) First, you upload images into a bucket. You get back an URL. You analyse the image with a model to produce a description of it. This is a "captioning" process or Image-To-Text. You save the URL and caption ( a short descriptive text) into a DB.
2) You produce an audio from a speech. You transcript this audio into text with a model; This is Speech-To-Text.
3) For the semantic search, you want to find the images whose caption approximates your audio. To do this, you use embeddings: this transcripts a text into a well-thought vector space, and then you use an approximation algorithm.
For this, you can use HNSWLib. You build incrementally an Index struct from your captions, and then run a knn_search
on this index with the audio transcription as an input.
I used the model "Salesforce/blip-image-captioning-base" with Bumblebee.Vision.image_to_text
. This produces a text caption that describes the image.
I used the model "openai/whisper-small" and Bumblebee.Audio.speech_to_text_whisper
. To capture the audio, I used the MediaRecorder API. This produces a text that is a transcription of the audio.
I used the transformer "sentence-transformers/paraphrase-MiniLM-L6-v2" and Bumblebee.Text.TextEmbedding.text_embedding
. You compute the embeddings for each image caption.
I used the Elixir binding for HNSWLib
. All the captions embeddings are used to build the HNSWLib.Index
.
You also compute the embedding of the audio transcription and use it as the input of HNSWLib.Index.knn_query
to find the closest neighbour(s) to the audio transcription embedding among the set of the caption embeddings.
This process is dependant on the metric used. It returns the position(s) (indices) among the Index struct indices. This is where you need to save whether the index or the embedding to look-up for the corresponding image(s).
As a reference, I added some comments on the library HNSWLib. My intention is: not pedantic but try to clarify when you can use a look-up by index and/or by embedding, and secondly why you need to normalise for cosine
in the second case.
The Index is instantiated as in the GenServer above.
Elixir binding for the hnswlib library.
Currently in development, alpha software.
# working in L2-space
# other possible values are
# `:ip` (inner product)
# `:cosine` (cosine similarity)
iex> space = :l2
:l2
# each vector is a 2D-vec
iex> dim = 2
2
# limit the maximum elements to 200
iex> max_elements = 200
200
# create Index
iex> {:ok, index} = HNSWLib.Index.new(space, dim, max_elements)
{:ok,
%HNSWLib.Index{
space: :l2,
dim: 2,
reference: #Reference<0.2548668725.3381002243.154990>
}}
The Index struct is incrementally build. Let's add 5 vectors:
iex> data =
Nx.tensor(
[
[42, 42],
[43, 43],
[0, 0],
[200, 200],
[200, 220]
],
type: :f32
)
#Nx.Tensor<
f32[5][2]
[
[42.0, 42.0],
[43.0, 43.0],
[0.0, 0.0],
[200.0, 200.0],
[200.0, 220.0]
]
>
# we get the number of elements in the Index struct
iex> HNSWLib.Index.get_current_count(index)
{:ok, 0}
# we add the 5 previous elements to the index struct
iex> HNSWLib.Index.add_items(index, data)
:ok
# we check the current count of the index struct
iex> HNSWLib.Index.get_current_count(index)
{:ok, 5}
knn_query
returns a tuple (:ok, labels, distances}
. The "labels" tensor contains the indices in the Index struct of the k
closest elements. The "distances" tensor contains the corresponding distance between these closest elements and the query input.
# query
iex> query = Nx.tensor([1, 2], type: :f32)
#Nx.Tensor<
f32[2]
[1.0, 2.0]
>
iex> {:ok, labels, dists} = HNSWLib.Index.knn_query(index, query)
{:ok,
#Nx.Tensor<
u64[1][1]
[
[2]
]
>,
#Nx.Tensor<
f32[1][1]
[
[5.0]
]
>}
# we look for the 3 closest neighbours
iex> {:ok, labels, dists} = HNSWLib.Index.knn_query(index, query, k: 3)
{:ok,
#Nx.Tensor<
u64[1][3]
[
[2, 0, 1]
]
>,
#Nx.Tensor<
f32[1][3]
[
[5.0, 3281.0, 3445.0]
]
>}
iex> HNSWLib.Index.save_index(index, "my_index.bin")
:ok
iex> {:ok, saved_index} = HNSWLib.Index.load_index(space, dim, "my_index.bin")
{:ok,
%HNSWLib.Index{
space: :l2,
dim: 2,
reference: #Reference<0.2105700569.2629697564.236704>
}}
iex> HNSWLib.Index.get_current_count(saved_index)
{:ok, 5}
iex> {:ok, data} = HNSWLib.Index.get_items(saved_index, [2, 0, 1])
{:ok,
[
<<0, 0, 0, 0, 0, 0, 0, 0>>,
<<0, 0, 40, 66, 0, 0, 40, 66>>,
<<0, 0, 44, 66, 0, 0, 44, 66>>
]}
iex> tensors = Nx.stack(Enum.map(data, fn d -> Nx.from_binary(d, :f32) end))
#Nx.Tensor<
f32[3][2]
[
[0.0, 0.0],
[42.0, 42.0],
[43.0, 43.0]
]
>
knn_query
If you add a single vector to the Index struct at a time, you can save the current index count. With the response of knn_query
, you can then look up for the closest elements.
:ok = HNSWLib.Index.add_items(index, normed_data)
{:ok, idx} = HNSWLib.Index.get_current_count(index)
App.Repo.insert(%Data{}, idx: idx)
:ok = HNSWLib.Index.save_index(index, saved_index)
If you add multiple vectors to the Index struct, you won't get each individual index but you can only look up with the embeddings. Then use get_items
as above and transform the binaries into embeddings for your look up.
When you endow the vector space with the cosine
distance, you need to normalise the embeddings (see note below)
Suppose you have an Nx.Serving
that runs a model. You get embeddings, save them and build your Index struct with them.
# you get an embedding
%{embedding: data} =
Nx.Serving.run(my_serving(), input)
# you normalise the embedding
data =
Nx.divide(data, Nx.LinAlg.norm(data))
# you save the embedding for the look up
{:ok, _input} =
App.Input.save(%Input{}, %{embedding: data})
# you build your Index struct
:ok = HNSWLib.Index.add_items(index, data)
HNSWLib.Index.save_index(index, "my_index.bin")
You run a knn_query
with a given embedding (from the same serving) for the closest element.
# you normalise your query data
%{embedding: query_data} =
Nx.Serving.run(my_serving(), input_embedding)
query_data =
Nx.divide(query_data, Nx.LinAlg.norm(query_data))
{:ok, labels, _d} =
HNSWLib.Index.knn_query(index, query_data, k: 1)
{:ok, data} = HNSWLib.Index.get_items(index, Nx.to_flat_list(labels[0]))
t =
data
|> Enum.map(fn d -> Nx.from_binary(d, :f32) end)
|> Nx.stack()
App.Input.get_by(%{embedding: t})
If available in Hex, the package can be installed
by adding hnswlib
to your list of dependencies in mix.exs
:
def deps do
[
{:hnswlib, "~> 0.1.0"}
]
end
Documentation can be generated with ExDoc and published on HexDocs. Once published, the docs can be found at https://hexdocs.pm/hnswlib.
A vector space of embeddings can be equiped with an (euclidean) inner product. If $u=(u_1,\dots,u_n)$ and $v=(v_1,\dots,v_n)$ are two embeddings with their coordinates, the (euclidean) inner product is defined as:
$< u,v >=u_1v_1+\cdots+u_nv_n$
This inner product induces an euclidean norm:
$||u|| = \sqrt{< u,u >} = \sqrt{u_1^2+\cdots+u_n^2}$
Let $u_v$ be the perpendicular projection of $u$ on $v$. Then:
$< u, v > = < u_v,v > = ||u||\cdot ||v|| \cos\widehat{u,v}$
The value below is known as the cosine similarity.
$<\frac{u}{||u||}\frac{v}{||v||}> = \cos\widehat{u,v}$.
Remark that the norm of the new embedding $\frac1{||u||}u$ is 1. We say that the embedding is $L_2$-normalised.
The previous formula shows that the inner product of normalised (aka unit) embeddings is the cosine
of the angle between these "normalised" embeddings.
Source: https://en.wikipedia.org/wiki/Cosine_similarity
Note that this is not a distance.
The norm in turn induces a distance: $d(u,v) = ||u-v||$
By definition,
$||u-v||^2 = < u-v,u-v >$.
so by developping, we obtain:
$||u-v||^2 = ||u||^2+||v||^2-2< u,v >$
Consider now two normalised vectors. We have: $\frac12||u-v||^2=1-\cos\widehat{u,v} = d_c(u,v)$
This is commonly known as the "cosine distance" when the embeddings are normalised. It ranges from 0 to 2. Note that it is not a true distance metric.
Finally, note that since we are dealing with finite dimensional vector spaces, all the norms are equivalent (in some precise mathematical way). This means that the limit points are always the same. However, the values of the distances can be quite different, and a "clusterisation" processes can give significantly different results. The first hint for which norm to choose is to take the norm used to train the model.
Closing with the merging of #47
Once you implement a db that saves the URL and caption, you can easily add a full-text search to render all the images that contain a text input. For simplicity, use SQLite. π
You can go a step further with ML models: speech-to-text, embeddings and semantic search. You transcribe an audio input into text, and use ML (Semantic search) to render matching images. π£οΈ