dwyl / image-classifier

πŸ–ΌοΈ Classify images and extract data from or describe their contents using machine learning
GNU General Public License v2.0
22 stars 3 forks source link

feat: Audio-to-text image caption search using embedding models [showcase] #18

Closed ndrean closed 8 months ago

ndrean commented 1 year ago

Once you implement a db that saves the URL and caption, you can easily add a full-text search to render all the images that contain a text input. For simplicity, use SQLite. πŸ”

You can go a step further with ML models: speech-to-text, embeddings and semantic search. You transcribe an audio input into text, and use ML (Semantic search) to render matching images. πŸ—£οΈ

ndrean commented 1 year ago

About this: I want to query to get all my pictures for a given subject. You make an audio to query them.

0) You upload a picture to a bucket and save the URL and the caption (Image-To-Text).

1) You record an audio and transcript it into text. This is "more or less" easy (the Javascript is not straightforward). But it works well.

2) I want to find images whose caption (saved in the database) is "closely related" to the audio (query text transcript).

You can do:

ndrean commented 1 year ago

A bit of reading: https://www.elastic.co/what-is/vector-search

Screenshot 2023-11-25 at 16 03 52
ndrean commented 1 year ago

The "easy" part: "speech-to-text". It uses de native browser MediaStream Record API and returns a WAV file (in mono audoi/pcm format) to feed a Bumblebee.Audio.speech_to_text_whisper module that load the model "openai/whisper-small".

You allow_upload(:speech, accept: :any, auto_upload: true, progress: &handle_progress/3).

The Javascript hook below is attached to a button to toggle on/off audio recording. It uses the MediaRecorder API.

A simple HTML like this:

 <audio id="audio" controls></audio>
 <button
        id="record"
        class="bg-blue-500 hover:bg-blue-700 text-white font-bold px-4 rounded"
        type="button"
        phx-hook="Audio"
        disabled={@micro_off}
      >
      <Heroicons.microphone
        outline
        class="w-6 h-6 text-white font-bold group-active:animate-pulse"
      />
      <span>Record</span>
 </button>

The Audio "hook":

export default {
  mounted() {
    let mediaRecorder;
    let audioChunks = [];
    const recordButton = document.getElementById("record");
    const audioElement = document.getElementById("audio");

    _this = this;

    recordButton.addEventListener("click", () => {
      if (mediaRecorder && mediaRecorder.state === "recording") {
        mediaRecorder.stop();
        recordButton.textContent = "Record";
      } else {
        navigator.mediaDevices.getUserMedia({ audio: true }).then((stream) => {
          mediaRecorder = new MediaRecorder(stream);
          mediaRecorder.start();
          recordButton.classList.remove("bg-blue-500", "hover:bg-blue-700");
          recordButton.classList.add(
            "bg-green-500",
            "hover:bg-green-700",
            "animate-pulse"
          );
          recordButton.textContent = "Stop";

          mediaRecorder.addEventListener("dataavailable", (event) => {
            audioChunks.push(event.data);
          });

          mediaRecorder.addEventListener("stop", () => {
            const audioBlob = new Blob(audioChunks);
            console.log(audioBlob);
            audioElement.src = URL.createObjectURL(blob);

            _this.upload("speech", [audioBlob]);
            audioChunks = [];
            recordButton.classList.remove(
              "bg-green-500",
              "hover:bg-green-700",
              "animate-pulse"
            );
            recordButton.classList.add("bg-blue-500", "hover:bg-blue-700");
          });
        });
      }
    });
  },
};

The Elixir code is copied from the documentation:

def handle_progress(:speech, entry, socket) when entry.done? do
    socket
    |> consume_uploaded_entry(entry, fn %{path: path} ->
      :ok = File.cp!(path, @tmp_wav)
      {:ok, @tmp_wav}
    end)

    audio_task =
      Task.Supervisor.async(
        App.TaskSupervisor,
        fn ->
          Nx.Serving.batched_run(Whisper, {:file, @tmp_wav})
        end
      )

    {:noreply, assign(socket, audio_ref: audio_task.ref, micro_off: true, speech_spin: true)}
 end

# intermediate progress, mandatory piece of code for any `allow_upload` (eg :speech, :image_list)
# this removes the `if entry.done?` switch above
def handle_progress(_, _), do: {:noreply, socket}

The Bumblebee model runner:

defmodule App.Whisper do
  def serving do
    {:ok, whisper} = Bumblebee.load_model({:hf, "openai/whisper-small"})
    {:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-small"})
    {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-small"})
    {:ok, generation_config} = Bumblebee.load_generation_config({:hf, "openai/whisper-small"})

    Bumblebee.Audio.speech_to_text_whisper(
      whisper,
      featurizer,
      tokenizer,
      generation_config,
      chunk_num_seconds: 30,
      task: :transcribe,
      # stream: true,
      defn_options: [compiler: EXLA]
    )
  end
end

It remains to code a handle_info({ref, %{chunks: [%{text: text]}, socket) to use the result of the Task.

def handle_info({ref, result}, %{assigns: assigns} = socket) do
    Process.demonitor(ref, [:flush])

    cond do
      #other tasks (Image-To-Text, example-list-task...)
      [...]
      #Task speech-to-text
      is_map(result) && Map.get(result, :chunks) != nil ->
        Path.expand("priv/static/image_uploads/tmp.wav")
        |> File.rm!()

        %{chunks: [%{text: text}]} = result
        # TODO: Task "create embedding from this text "result", and "run ANN search"
        {:noreply, assign(socket, transcription: String.trim(text), micro_off: false)}
   end
end
ndrean commented 12 months ago

Super slow progress, should spend more time on it. I upload to Cloudfare R2 every image, and save into an SQLite db the returned URL, width, height and "Image-To-Text" caption. The "FTS" uses embedded FTS5 (virtual table). For vector search, add the dependency sqlite_vss.

1) Full-Text-Search with a simple input. Allow several keywords separated by a white space.

defmodule App.Repo do
  use Ecto.Repo,
    otp_app: :app,
    adapter: Ecto.Adapters.SQLite3
end

defmodule App.Image do
  use Ecto.Schema
  import Ecto.Query
  alias App.{Image, Repo, User}

  @primary_key {:id, :id, autogenerate: true, source: :rowid}
  schema "images" do
    field(:url, :string)
    field(:caption, :string)
    field(:width, :integer)
    field(:height, :integer)
    field(:rank, :float, virtual: true)

    belongs_to(:user, User)

    timestamps()
  end

  def changeset(%{} = params) when is_map(params) do
    Ecto.Changeset.cast(%Image{}, params, [:url, :caption, :width, :height])
  end

  def insert(%{} = params) when is_map(params) do
    changeset(params)
    |> Repo.insert!()
  end

  def search(q) do
    query = String.replace(q, " ", " OR ")

    from(img in Image,
      select: [:url, :rank, :id, :caption],
      where: fragment("images MATCH ?", ^query),
      order_by: [asc: :rank]
    )
    |> Repo.all()
  end
end

#page_live.ex
def handle_event("search", %{"query" => ""}, socket) do
    {:noreply, assign(socket, message: "")}
end

def handle_event("search", %{"query" => q}, socket) do
    l = App.Image.search(q)
    message = "#{length(l)} images found for '#{q}'"
    # just return this info without displaying all the images via their URL
    {:noreply,
     assign(socket,
       message: message,
       fts_images: Enum.map(l, & &1.url),
       search_form: to_form(%{"query" => ""}, id: "query")
     )}
end
ndrean commented 12 months ago

2) Using Semantic search aka vector search.

We will use embeddings and ANN. This is based on SentenceTransformers.

2.1) Produce an embedding from the I2T caption

2.2)Produce an embedding from the S2T (audio transcription)

with Bumblebee.Text.text_embedding.

defmodule App.TextEmbedding do
  def serving do
    embedding_model = "sentence-transformers/paraphrase-MiniLM-L6-v2"
    # model = "intfloat/e5-large"
    {:ok, model_info} = Bumblebee.load_model({:hf, embedding_model})
    {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf,  embedding_model})

    serving = Bumblebee.Text.TextEmbedding.text_embedding(model_info, tokenizer)
  end
end

#Application.ex
[...]
children = [
  ...
  {Nx.Serving,
       serving: App.TextEmbedding.serving(),
       name: TextEmbedding,
       batch_size: 8,
       batch_timeout: 100
  },
]
...

SQLITE_VSS set up:

config :app, App.Repo,
  database: "./priv/sqlite.db",
  pool_size: 5,
  show_sensitive_data_on_connection_error: true,
  load_extensions: [SqliteVss.loadable_path_vector0(), SqliteVss.loadable_path_vss0()]

Not sure yet how to run ANN along with (probably) pre-filtered data (the images belonging to a user) EDITED: The sqlite_vss is a bit obscure. I imagine you save the vector into the db and this extension can compute some distance algorithm.

ndrean commented 12 months ago

[EDITED] πŸš€πŸŽ‰πŸ₯΅

Here I tried HSWNLib which seems close to Facebook's FAISS. Seems that you have a kind of database with an index. However, the documentation is very thin (sic). My interpretation (!) It seems you need to run a GenServer to get the "serving" and instantiate the index. I use the model "sentence-transformers/paraphrase-MiniLM-L6-v2" based on the BertModel, already implemented in Bumblebee. Its dimension is 384. Why transformer? Because it carries context, so closely related sentences should be close with some distance! A short and very clear intro. For the distance, it must be a matter of experience on which one to use between the ("traditionnal") euclidean distance L2, the Manhattan distance L1 or cosine similarity (not a norm, but the inner product normalized by the L2). When I upload an image, I compute an embedding on the caption text and add the tensor into the index. When I make an audio transcription, I run an embedding, get a tensor and make a knn_query against the index with this tensor.

Tests

I uploaded 3 pictures, one with a boat, one of a city, one with a car. I say something in the mic and it finds the images! Amazing!

input: audio transcription----------
[(app 0.1.0) lib/app_web/live/page_live.ex:216: AppWeb.PageLive.handle_info/2]
text #=> " A vehicle in a parking"
                   ^^^

result #=> %App.Image{
  __meta__: #Ecto.Schema.Metadata<:loaded, "images">,
  id: 3,
  url: "https://s3.eu-west-3.amazonaws.com/dwyl-imgup/test.webp",
  caption: "a silver car parked on the side of a road",
                   ^^^
  width: 960,
  height: 656,
  embedding: [-0.1475222408771515, -0.04520482197403908, 0.05124518647789955, ...],
  user_id: nil,
  user: #Ecto.Association.NotLoaded<association :user is not loaded>,
}

πŸš€

input: audio transcription----------
[(app 0.1.0) lib/app_web/live/page_live.ex:212: AppWeb.PageLive.handle_info/2]
text #=> " A boat next to the beach.
                   ^^^

result #=> %App.Image{
  __meta__: #Ecto.Schema.Metadata<:loaded, "images">,
  id: 1,
  url: "https://s3.eu-west-3.amazonaws.com/dwyl-imgup/test.webp",
  caption: "a boat is docked in the water near a tree",
                  ^^^
  width: 578,
  height: 551,
  embedding: [-0.11860480159521103, 0.07705779373645782, ...],
  user: #Ecto.Association.NotLoaded<association :user is not loaded>,
}

πŸ‘Œ

Amazing, with just a few lines of code!

The result from knn_search is in the following form: a label which is the index position of the result (can be resultS is you fixe k>1).

{:ok, label, distance} = 
  {:ok,
   #Nx.Tensor<
     u64[1][1]
     EXLA.Backend<host:0, 0.2015035495.2514354192.41223>
     [
       [2]
       ^^^index
     ]
   >,
   #Nx.Tensor<
     f32[1][1]
     EXLA.Backend<host:0, 0.2015035495.2514354192.41224>
     [
       [0.18454807996749878]
     ]
   >}

Code

Launch the Embedding model and instantiate the index: read from the existing indexes - saved into a file - or create one.

#Application
   children = [..., App.TextEmbedding, ...]

defmodule App.TextEmbedding do
  use GenServer
  # use ENV VAR for the indexes file rather than hardcoded

  def start_link(_) do
    GenServer.start_link(__MODULE__, {}, name: __MODULE__)
  end

  def init(_) do
    # use existing index file or create a new one
    {:ok, index} =
      case File.ls!(Path.expand("priv")) |> Enum.member?("indexes.bin") do
        false ->
          HNSWLib.Index.new(_space = :l2, _dim = 384, _max_elements = 200)

        true ->
          HNSWLib.Index.load_index(:l2, 384, Path.expand("priv/indexes.bin"))
      end

    {:ok, {nil, nil, index}, {:continue, :load}}
  end

  # async load the model to allow other processes to continue
  def handle_continue(:load, {_,_,index}) do
     {:ok, %{model: _model, params: _params} = model_info} =
      Bumblebee.load_model({:hf, "sentence-transformers/paraphrase-MiniLM-L6-v2"})

    {:ok, tokenizer} =
      Bumblebee.load_tokenizer({:hf, "sentence-transformers/paraphrase-MiniLM-L6-v2"})

    {:noreply, {model_info, tokenizer, index})
  end

  # called in the Liveview "mount" callback
  def serve() do
    GenServer.call(__MODULE__, :serve)
  end

  def handle_call(:serve, _from, {model_info, tokenizer, index} = state) do
    serving = Bumblebee.Text.TextEmbedding.text_embedding(model_info, tokenizer)
    {:reply, {serving, index}, state}
  end
end
#page_live.ex
def mount(_params, _session, socket) do
    {serving, index} = App.TextEmbedding.serve()
    {:ok, assign(socket, serving: serving, index: index, ...}
# def handle_info({:ref, result}, %{assigns: assigns} = socket
cond do
   # return from Image-To-Text, compute embedding and then Upload into a bucket Task
   Map.get(assigns, :I2T_task_ref) == ref ->
        %{results: [%{text: label}]} = result
       # comptue the embedding on the caption
       %{embedding: data} = Nx.Serving.run(serving, label)
        # add as an index
        HNSWLib.Index.add_items(index, data)

        # since the image URL and caption is saved in the db, we also save the new index into a file
        :ok = HNSWLib.Index.save_index(index, Path.expand("priv/indexes.bin"))

        # test check
        #HNSWLib.Index.get_current_count(index)

        # update database
       db_img =
          case assigns.db_img do
            nil ->
              App.Image.insert!(%App.Image{}, %{embedding: Nx.to_flat_list(data)})

            img ->
              App.Image.update!(img, %{embedding: data})
          end
        ...
        ...

   # return audio transcription, and compute an embedding and insert into the index 
   is_map(result) && Map.get(result, :chunks) != nil ->
        Path.expand("priv/static/image_uploads/tmp.wav")
        |> File.rm!()

        %{chunks: [%{text: text}]} = result
        %{embedding: data} = Nx.Serving.run(serving, text)
        with %App.Image{} = result <- handle_knn(data, index) do
          {:noreply,
           assign(socket,
             transcription: String.trim(text),
             micro_off: false,
             search_result: result
           )}
        else
        ...

I copied literally the example in the Github Repo:

def handle_knn(data, index) do
    case HNSWLib.Index.get_current_count(index) do
      {:ok, 0} ->
        {:error, "no entries in index"}

      {:ok, _c} ->
        case HNSWLib.Index.knn_query(index, data, k: 1) do
          {:ok, label, _distance} ->
            {:ok, binary_embedding} =
              label[0]
              |> Nx.to_flat_list()
              |> then(&HNSWLib.Index.get_items(index, &1))

            binary_embedding
            |> Enum.map(&Nx.from_binary(&1, :f32))
            |> Nx.stack()
            |> Nx.to_flat_list()
            |> then(&App.Repo.get_by!(App.Image, %{embedding: &1}))

          {:error, msg} ->
            {:error, msg}
        end
    end
  end
ndrean commented 12 months ago

Ten minutes worth spending to grasp what transformers do: "all about attention"

Screenshot 2023-12-01 at 09 22 19

ndrean commented 12 months ago

From zero to hero! So the branch where speech-and-find works πŸ₯΅: https://github.com/ndrean/image-classifier/tree/upload-save-db

It is based on 2 models: "openai/whisper-tiny" for the Text-To-Audio transcription, and the transformer model "sentence-transformers/paraphrase-MiniLM-L6-v2" (based on the BertModel) to compute the embeddings from text. It finally uses HSWNLib to run the ANN algorithm.

The uploads are saved in a bucket and the server gets back the URL. Note that to run ML to get a caption on an image, you need the binary so the upload is done server-side. On each upload, we compute the embedding and save it in the database.

I changed from SQLite to Postgres firstly because you can use arrays of floats with Postgres whilst you need to serialise with SQLite. The Full-Text-Search has to be rewritten but this is not the important point here. FTS is more an illustration of the old "primitive non-semantical" search (but can be relevant, so its good to know how to use it).

The ANN sorts on the distance. However, not all responses are semantically meaningful. Say you want the 3 closest embeddings, the algorithm will find them, even if the answers are "far". The answer should be improved with some cut-off distance given some kind of normalisation, probably using the :cosine slmilarity. To be continued with more tests.

LuchoTurtle commented 12 months ago

Such an awesome work and great read, as always @ndrean ! Found myself reverting to the link you've sent a few weeks back (https://www.youtube.com/watch?v=QdDoFfkVkcw) to freshen up my embedding knowledge.

I was implementing simple DB with SQLite just to learn (haven't had the chance to use it) but because of this, I might as well change it to Postgre just so I could save embeddings and make a vector database out of it.

I've always had some trouble what to do after I have these embeddings - how to do a similarity search. Thanks for showcasing three alternatives on how to achieve this!

If you want to push a PR, feel free to do so. Otherwise, I'll check the code and do it whilst crediting you, because this is awesome to have in this small project!

LuchoTurtle commented 12 months ago

Posting the following links, which seems to details the three methods @ndrean mentioned πŸš€

From what I've seen, because you're using sentence-transformers/all-MiniLM-L6-v2, since it was trained using cosine similarity , using cosine similarity for the index will produce the most accurate result.

According to the Pinecone link:

Remember the general principle we mentioned in the introduction of this post when selecting the similarity metric for your index: use the same similarity metric that was used to train your embedding model. Otherwise, you should experiment with various other similarity metrics to see if you can produce even better results (for instance, if you don’t know what similarity metric was used in the embedding mode or if the method the vectors were created with does not have such a metric in the generation process ).

ndrean commented 12 months ago

ah thanks @LuchoTurtle! sentence-transformers/all-MiniLM-L6-v2 is trained using cosine. ok! I did not find this.

Just change the definition in the GenServer:

HNSWLib.Index.new(_space = :l2, _dim = 384, _max_elements = 200)

to

HNSWLib.Index.new(_space = :cosine _dim = 384, _max_elements = 200)

This makes sense. We are looking for "similar to the caption" and these are short and descriptive. When you ask for an image, your question is short and descriptive (normally!). So captions and questions live in a similar world.

Notice I didn't produce any original code but rather completed the puzzle. So:

One answer is a specialised vector database. But I didn't want to add a side container running Elastisearch for example. Maybe a vector extension of a database with search capabilities? pg_vector for Postgres, and sqlite_vss for SQLIte. I don't have enough knowledge. So an external algorithm? Elixir/Nx has an "alpha"-response with HSWNLib. It is easy to understand how it works. You add every new embedding to an "index database" and then knn_query any input (as an embedding) against this index. Easy enough for me to understand.

So quite a few moving parts, definitely needs lots of technical knowledge to dive into this world.

Find 3 closest neighbours:

# "index" is the "index database"
# "data" is the input embedding
HNSWLib.Index.knn_query(index, data, k: 3)
                                      ^^^

Why do I save the embeddings in the database? Because the knn_query response is a set of (casi) embeddings so I have to query the DB to find the images via the embeddings.

One detail: when you upload an image, you need to give a unique name. I though that this name could be the SHA1 but I believe you may prefer another solution.

Lastly, if you guys are interested with this feature, I should rewrite a new branch where I inject the 3 main ingredients: upload, speech-to-text, and the semantic search. This should be easier step by step than you digging into my code.

ndrean commented 11 months ago

Just an addition as I forgot, and to save you time: don't save the embedding, but save the current index once you decide to add the embedding into the Index as HNSWLib builds it incrementally. You then only need to query for the matching index with your found index. Furthermore, the cosine does not work for embeddings but for normalised embeddings as I discovered. So I had a first bad idea to save the embedding and use it :) , but when I gave cosine a try, I realised it would save space and just work if you work directly with indexes instead, and is easier. I had just concerns with multiple access on this index file if several users were uploading images at the same time (race condition?). What I am saying might seem total rubbish, but once you just read HNSWLib Readme (and there is nothing more to read) and think about it, you will understand immediately. 🀯

ndrean commented 11 months ago

I thought it could be a nice add-on to (try to) visualise a bit the process I came up with (I will write something on dev.to for me not to forget and cite you as the proper implementation if you agree).

You have images saved in the cloud. You want to find some special ones. How?

You can record an audio that describes an image to find some corresponding images among the ones you uploaded.

How to proceed?

1) First, you upload images into a bucket. You get back an URL. You analyse the image with a model to produce a description of it. This is a "captioning" process or Image-To-Text. You save the URL and caption ( a short descriptive text) into a DB.

2) You produce an audio from a speech. You transcript this audio into text with a model; This is Speech-To-Text.

3) For the semantic search, you want to find the images whose caption approximates your audio. To do this, you use embeddings: this transcripts a text into a well-thought vector space, and then you use an approximation algorithm. For this, you can use HNSWLib. You build incrementally an Index struct from your captions, and then run a knn_search on this index with the audio transcription as an input.

semantic_search

Image-To-Text

I used the model "Salesforce/blip-image-captioning-base" with Bumblebee.Vision.image_to_text. This produces a text caption that describes the image.

Speech-To-Text

I used the model "openai/whisper-small" and Bumblebee.Audio.speech_to_text_whisper. To capture the audio, I used the MediaRecorder API. This produces a text that is a transcription of the audio.

Embedding

I used the transformer "sentence-transformers/paraphrase-MiniLM-L6-v2" and Bumblebee.Text.TextEmbedding.text_embedding. You compute the embeddings for each image caption.

Semantic search

I used the Elixir binding for HNSWLib. All the captions embeddings are used to build the HNSWLib.Index. You also compute the embedding of the audio transcription and use it as the input of HNSWLib.Index.knn_query to find the closest neighbour(s) to the audio transcription embedding among the set of the caption embeddings. This process is dependant on the metric used. It returns the position(s) (indices) among the Index struct indices. This is where you need to save whether the index or the embedding to look-up for the corresponding image(s).

ndrean commented 11 months ago

As a reference, I added some comments on the library HNSWLib. My intention is: not pedantic but try to clarify when you can use a look-up by index and/or by embedding, and secondly why you need to normalise for cosine in the second case.

The Index is instantiated as in the GenServer above.

HNSWLib

Elixir binding for the hnswlib library.

Currently in development, alpha software.

Usage

Create an Index struct

# working in L2-space
# other possible values are
#  `:ip` (inner product)
#  `:cosine` (cosine similarity)
iex> space = :l2
:l2
# each vector is a 2D-vec
iex> dim = 2
2
# limit the maximum elements to 200
iex> max_elements = 200
200
# create Index
iex> {:ok, index} = HNSWLib.Index.new(space, dim, max_elements)
{:ok,
 %HNSWLib.Index{
   space: :l2,
   dim: 2,
   reference: #Reference<0.2548668725.3381002243.154990>
 }}

Add vectors to the Index struct

The Index struct is incrementally build. Let's add 5 vectors:

iex> data =
  Nx.tensor(
    [
      [42, 42],
      [43, 43],
      [0, 0],
      [200, 200],
      [200, 220]
    ],
    type: :f32
  )
#Nx.Tensor<
  f32[5][2]
  [
    [42.0, 42.0],
    [43.0, 43.0],
    [0.0, 0.0],
    [200.0, 200.0],
    [200.0, 220.0]
  ]
>
# we get the number of elements in the Index struct
iex> HNSWLib.Index.get_current_count(index)
{:ok, 0}
# we add the 5 previous elements to the index struct
iex> HNSWLib.Index.add_items(index, data)
:ok
# we check the current count of the index struct
iex> HNSWLib.Index.get_current_count(index)
{:ok, 5}

Query nearest vector(s) in the index

knn_query returns a tuple (:ok, labels, distances}. The "labels" tensor contains the indices in the Index struct of the k closest elements. The "distances" tensor contains the corresponding distance between these closest elements and the query input.

# query
iex> query = Nx.tensor([1, 2], type: :f32)
#Nx.Tensor<
  f32[2]
  [1.0, 2.0]
>
iex> {:ok, labels, dists} = HNSWLib.Index.knn_query(index, query)
{:ok,
 #Nx.Tensor<
   u64[1][1]
   [
     [2]
   ]
 >,
 #Nx.Tensor<
   f32[1][1]
   [
     [5.0]
   ]
 >}

# we look for the 3 closest neighbours
iex> {:ok, labels, dists} = HNSWLib.Index.knn_query(index, query, k: 3)
{:ok,
 #Nx.Tensor<
   u64[1][3]
   [
     [2, 0, 1]
   ]
 >,
 #Nx.Tensor<
   f32[1][3]
   [
     [5.0, 3281.0, 3445.0]
   ]
 >}

Save an Index to file

iex> HNSWLib.Index.save_index(index, "my_index.bin")
:ok

Load an Index from file

iex> {:ok, saved_index} = HNSWLib.Index.load_index(space, dim, "my_index.bin")
{:ok,
 %HNSWLib.Index{
   space: :l2,
   dim: 2,
   reference: #Reference<0.2105700569.2629697564.236704>
 }}
iex> HNSWLib.Index.get_current_count(saved_index)
{:ok, 5}
iex> {:ok, data} = HNSWLib.Index.get_items(saved_index, [2, 0, 1])
{:ok,
 [
   <<0, 0, 0, 0, 0, 0, 0, 0>>,
   <<0, 0, 40, 66, 0, 0, 40, 66>>,
   <<0, 0, 44, 66, 0, 0, 44, 66>>
 ]}
iex> tensors = Nx.stack(Enum.map(data, fn d -> Nx.from_binary(d, :f32) end))
#Nx.Tensor<
  f32[3][2]
  [
    [0.0, 0.0],
    [42.0, 42.0],
    [43.0, 43.0]
  ]
>

How to use knn_query

If you add a single vector to the Index struct at a time, you can save the current index count. With the response of knn_query, you can then look up for the closest elements.

:ok  = HNSWLib.Index.add_items(index, normed_data)
{:ok, idx}  =  HNSWLib.Index.get_current_count(index)
App.Repo.insert(%Data{}, idx: idx)
:ok =  HNSWLib.Index.save_index(index, saved_index)

If you add multiple vectors to the Index struct, you won't get each individual index but you can only look up with the embeddings. Then use get_items as above and transform the binaries into embeddings for your look up.

When you endow the vector space with the cosine distance, you need to normalise the embeddings (see note below)

Example

Suppose you have an Nx.Serving that runs a model. You get embeddings, save them and build your Index struct with them.

# you get an embedding
%{embedding: data} =
    Nx.Serving.run(my_serving(), input)
# you normalise the embedding
data =
    Nx.divide(data, Nx.LinAlg.norm(data))
# you save the embedding for the look up
{:ok, _input} =
    App.Input.save(%Input{}, %{embedding: data})
# you build your Index struct
:ok = HNSWLib.Index.add_items(index, data)
HNSWLib.Index.save_index(index, "my_index.bin")

You run a knn_query with a given embedding (from the same serving) for the closest element.

# you normalise your query data
%{embedding: query_data} =
    Nx.Serving.run(my_serving(), input_embedding)

query_data =
    Nx.divide(query_data, Nx.LinAlg.norm(query_data))
{:ok, labels, _d} =
    HNSWLib.Index.knn_query(index, query_data, k: 1)

{:ok, data} = HNSWLib.Index.get_items(index, Nx.to_flat_list(labels[0]))

t =
    data
    |> Enum.map(fn d -> Nx.from_binary(d, :f32) end)
    |> Nx.stack()

App.Input.get_by(%{embedding: t})

Installation

If available in Hex, the package can be installed by adding hnswlib to your list of dependencies in mix.exs:

def deps do
  [
    {:hnswlib, "~> 0.1.0"}
  ]
end

Documentation can be generated with ExDoc and published on HexDocs. Once published, the docs can be found at https://hexdocs.pm/hnswlib.

Notes on vector spaces

A vector space of embeddings can be equiped with an (euclidean) inner product. If $u=(u_1,\dots,u_n)$ and $v=(v_1,\dots,v_n)$ are two embeddings with their coordinates, the (euclidean) inner product is defined as:

$< u,v >=u_1v_1+\cdots+u_nv_n$

This inner product induces an euclidean norm:

$||u|| = \sqrt{< u,u >} = \sqrt{u_1^2+\cdots+u_n^2}$

Let $u_v$ be the perpendicular projection of $u$ on $v$. Then:

$< u, v > = < u_v,v > = ||u||\cdot ||v|| \cos\widehat{u,v}$

The value below is known as the cosine similarity.

$<\frac{u}{||u||}\frac{v}{||v||}> = \cos\widehat{u,v}$.

Remark that the norm of the new embedding $\frac1{||u||}u$ is 1. We say that the embedding is $L_2$-normalised.

The previous formula shows that the inner product of normalised (aka unit) embeddings is the cosine of the angle between these "normalised" embeddings.

Source: https://en.wikipedia.org/wiki/Cosine_similarity

Note that this is not a distance.

The norm in turn induces a distance: $d(u,v) = ||u-v||$

By definition,
$||u-v||^2 = < u-v,u-v >$. so by developping, we obtain: $||u-v||^2 = ||u||^2+||v||^2-2< u,v >$

Consider now two normalised vectors. We have: $\frac12||u-v||^2=1-\cos\widehat{u,v} = d_c(u,v)$

This is commonly known as the "cosine distance" when the embeddings are normalised. It ranges from 0 to 2. Note that it is not a true distance metric.

Finally, note that since we are dealing with finite dimensional vector spaces, all the norms are equivalent (in some precise mathematical way). This means that the limit points are always the same. However, the values of the distances can be quite different, and a "clusterisation" processes can give significantly different results. The first hint for which norm to choose is to take the norm used to train the model.

LuchoTurtle commented 8 months ago

Closing with the merging of #47