Documentation issues - Githubissues

pkpro commented 6 months ago

Describe the bug

I'm referring to the documentation at the following URL: https://epsilla-inc.gitbook.io/epsilladb/vector-database

Missing documentation on querying existing databases
Missing documentation on querying existing tables
Missing documentation on querying fields of existing tables
Missing documentation on filtering syntax
Missing documentation on indexing (see additional context).

Additional context

There is some rudimentary documentation on indexing, but some key points are missing and some are unclear: a) No information on how to create the table with an index on the embedding field with VECTOR_FLOAT dataType, which is not created by "model", but provided as part of the data during insert.

My use case: inserting billions of language sentences (STRINGs), with their embeddings, and query them with embedding vector later on to retrieve a sentence.

b) It is not clear if the "Embedding" name of the field in the table is a keyword or if the embedding vector column can have an arbitrary name. c) It is also unclear if externally built embedding is inserted along with the data into the table, will it be indexed automatically (by its name "Embedding" or by its type "VECTOR_FLOAT"). d) Is it possible to index any other dataType then STRING? From the documentation: When creating tables, you can define indices to let Epsilla automatically create embeddings for the STRING fields And then later on: Then you can insert records in their raw format and let Epsilla handle the embedding followed by an example with insert of the text data and their embeddings, though the "Embedding" column is not defined in the table (in the previous code snippet) and despite the fact that Epsilla is promised to create the embeddings automatically.

pkpro commented 6 months ago

One more missing point: How to query amount of records in a table?

pkpro commented 6 months ago

Just to be clear, in the following points it is all about the metadata (list of databases, tables, fields of the tables): Missing documentation on querying existing databases Missing documentation on querying existing tables Missing documentation on querying fields of existing tables

richard-epsilla commented 6 months ago

Thank you so much for identifying missing pieces of our documentation/functionality. We are on it

pkpro commented 6 months ago

You may also state in your documentation and in the description to your database that inserts are taking constant time (at least that I've experienced with 4.7M inserted records on a single database). It is also would be an advantage to state that the index creation is not required at all and the index creation is a call to a model for embedding creation in the first place.

Also you may use following example in any form, that uses external model to create embeddings and place them into the table and then query the data:

import os
import sys
import time
import orjson
import argparse
import pprint
from pyepsilla import vectordb
from sentence_transformers import SentenceTransformer

serial = 25000001
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2', device='cuda:0')

parser = argparse.ArgumentParser()
parser.add_argument('--host', type=str, required=True, help='epsilla database host')
parser.add_argument('--port', type=int, required=True, help='epsilla database port')
parser.add_argument('--dbname', type=str, required=True, help='epsilla database name')
parser.add_argument('--insert', type=bool, required=False, default=0, help='Do we need to insert the example sentences (True/False)?')
parser.add_argument('--create', type=bool, required=False, default=0, help='Do we need to create the sentences table (True/False)?')
parser.add_argument('--threshold', type=float, required=False, default=0.06, help='Similarity threshold')

args = parser.parse_args()

with open("sentences.json", "r") as json_file:
    data = orjson.loads(json_file.read())
    sentences=[item['sentence'] for item in data]
    languages=[item['language'] for item in data]
    stime=time.time()
    embeddings = model.encode(sentences, batch_size=len(sentences))
    etime=time.time()
    print(f"Embeddings were generated in {etime-stime}s")

    epsilla_client=vectordb.Client(host=args.host, port=args.port)
    epsilla_client.load_db(db_name=args.dbname, db_path=f"/data/epsilla/")
    epsilla_client.use_db(db_name=args.dbname)

    if args.create:
      status_code, response = epsilla_client.create_table(
        table_name="sentences",
        table_fields=[
            {"name": "id", "dataType": "INT", "primaryKey": True},
            {"name": "sentence", "dataType": "STRING"},
            {"name": "language", "dataType": "STRING"},
            {"name": "vector", "dataType": "VECTOR_FLOAT", "dimensions": 768, "metricType": "COSINE"}
        ]
      )
      print(f"Table creation Status Code: {status_code}, response: {response}")
      if status_code not in (200, 409):
        sys.exit(1)

    if args.insert:
        data_to_insert = [
          {'id': serial, 'sentence': sentence, 'language': language, 'vector': embedding.tolist()}
          for serial, (sentence, language, embedding) in enumerate(zip(sentences, languages, embeddings), start=(serial+1))
        ]

        status_code, response = epsilla_client.insert(
          table_name="sentences",
          records=data_to_insert
        )

        print(f"Inert status: {status_code}, response: {response}")

    stime=time.time()
    status_code, response = epsilla_client.query(
      table_name="sentences",
      query_field="vector",
      response_fields=["id", "language", "sentence"],
      query_vector=embeddings[0].tolist(),
      limit=10,
      with_distance=True
    )
    etime=time.time()

    print(f"Query status: {status_code}")
    if status_code == 200:
        # Negative threshold is here to account for floating-point precision error.
        # Distance for exactly the same embedding is close to 0, but due to the precision error, it might not be exactly 0 and may well be negative.
        records=[record for record in response['result'] if record['@distance'] < args.threshold and record['@distance'] > -0.00001 ]
        pp = pprint.PrettyPrinter(indent=2, width=120, depth=None, compact=False)
        pp.pprint(records)
    else:
        print(f"Epsilla Error: {response}")
    print(f"Epsilla responded to query in {etime-stime}s")

The above code is to be used with data like:

[
  { "sentence": "- Two beautiful race cars are about to start. Which one will win, Bob?\n - Of course, the red one! Everyone knows that red cars are the fastest, Alice!", "language" : "en" },
  { "sentence": "- Два красивых гоночных автомобиля собираются начать гонку. Какой победит, Боб?\n - Конечно, красный! Все знают, что красные машины самые быстрые, Элис!", "language" : "ru" },
  { "sentence": "- Zwei wunderschöne Rennwagen stehen kurz vor dem Start. Welcher wird gewinnen, Bob?\n - Natürlich der rote! Jeder weiß, dass rote Autos die schnellsten sind, Alice!", "language" : "de" }
]

Embeddings were generated in 0.23432421684265137s
[INFO] Connected to localhost:8888 successfully.
Query status: 200
[ { '@distance': -1.1920928955078125e-07,
    'id': 25000002,
    'language': 'en',
    'sentence': '- Two beautiful race cars are about to start. Which one will win, Bob?\n'
                ' - Of course, the red one! Everyone knows that red cars are the fastest, Alice!'},
  { '@distance': 0.04510009288787842,
    'id': 25000004,
    'language': 'de',
    'sentence': '- Zwei wunderschöne Rennwagen stehen kurz vor dem Start. Welcher wird gewinnen, Bob?\n'
                ' - Natürlich der rote! Jeder weiß, dass rote Autos die schnellsten sind, Alice!'},
  { '@distance': 0.05582815408706665,
    'id': 25000003,
    'language': 'ru',
    'sentence': '- Два красивых гоночных автомобиля собираются начать гонку. Какой победит, Боб?\n'
                ' - Конечно, красный! Все знают, что красные машины самые быстрые, Элис!'}]
Epsilla responded to query in 0.011300325393676758s

You have an amazing product guys, I just stumbled upon it by a chance and I'm really glad I found your project. Amazing performance and usability. I hope your project will get more attention which it really deserves.

epsilla-cloud / vectordb

Documentation issues #134