LangStream / langstream

LangStream. Event-Driven Developer Platform for Building and Running LLM AI Apps. Powered by Kubernetes and Kafka.
https://langstream.ai
Apache License 2.0
393 stars 29 forks source link

PGVector as vector db #718

Closed Filo01 closed 1 year ago

Filo01 commented 1 year ago

I'm trying to use PGVector to store the embeddings produced by this crawler

I've added this in the crawler.yaml:

assets:
  - name: "documents-table"
    asset-type: "jdbc-table"
    creation-mode: create-if-not-exists
    config:
      table-name: "documents"
      datasource: "JdbcDatasource"
      create-statements:
        - |
          CREATE EXTENSION IF NOT EXISTS vector;
          CREATE TABLE documents (
          filename TEXT,
          chunk_id int,
          num_tokens int,
          lang TEXT,
          text TEXT,
          embeddings_vector vector,
          PRIMARY KEY (filename, chunk_id));

and modified this

- name: "Write"
    type: "vector-db-sink"
    input: chunks-topic
    configuration:
      datasource: "JdbcDatasource"
      table-name: "documents"
      fields:
        - name: "filename"
          expression: "value.filename"
          primary-key: true
        - name: "chunk_id"
          expression: "fn:toInt(value.chunk_id)"
          primary-key: true
        - name: "embeddings_vector"
          expression: "fn:toListOfFloat(value.embeddings_vector)"
        - name: "lang"
          expression: "value.language"
        - name: "text"
          expression: "value.text"
        - name: "num_tokens"
          expression: "value.chunk_num_tokens"

but I'm having this error when doing the INSERT ai.langstream.runtime.agent.AgentRunner$PermanentFailureException: org.postgresql.util.PSQLException: Can't infer the SQL type to use for an instance of java.util.ArrayList. Use setObject() with an explicit Types value to specify the type to use.

I also tried adding this in configuration.yaml:

  dependencies:
   ...
    - name: "PGVector"
      url: "https://repo1.maven.org/maven2/com/pgvector/pgvector/0.1.3/pgvector-0.1.3.jar"
      sha512sum: "84c37b191e20730a143963108aa11f5c4c28831b8c2637c1d639d2aaaa44f4803461e5b68c353ea51ba8abddde837796ebad65572aa2ac95c0fcbfd76cbf73d1"
      type: "java-library"

And creating a PGVector via EL like this

        - name: "embeddings_vector"
          expression: "new com.pgvector.PGvector(fn:toListOfFloat(value.embeddings_vector))"```

but it's probably invalid as EL and I'm not sure how to proceed from there

Filo01 commented 1 year ago

I solved it by avoiding the expression language completely by doing it this way instead:

  - name: "Write"
    type: "query"
    input: chunks-topic
    configuration:
      datasource: "JdbcDatasource"
      mode: "execute"
      query: |
        INSERT INTO documents(filename,chunk_id,num_tokens,lang,text,embeddings_vector) values ( ? , ? , ? , ? , ? , CAST( ? as vector ) );
      output-field: "value.command_result"
      fields:
        - "value.filename"
        - "fn:toInt(value.chunk_id)"
        - "fn:toInt(value.chunk_num_tokens)"
        - "value.language"
        - "value.text"
        - "fn:toString(fn:toListOfFloat(value.embeddings_vector))"
eolivelli commented 1 year ago

This a full solution with PG https://github.com/LangStream/langstream/tree/main/examples/applications/query-postgresql-chat-history