Simplify Architecture - Githubissues

epeters3 commented 8 months ago

By refactoring to perform incremental updates instead of whole system re-ingests, we can eliminate the need for an intermediate datastore (i.e. MongoDB), and ingest directly from the Gospel Library site to the search engine. Also, by refactoring to use a Vector database instead of ElasticSearch, we can eliminate the need for a re-ranking service. This would simplify the Data Transformation Pipeline to:

sequenceDiagram
    actor Operator
    Operator->>Worker: PUT /ingest
    Vector DB->>Worker: current state
    Note over Worker: determine which pages <br/> haven't been ingested yet
    Worker->>Gospel Library: requests
    Gospel Library->>Worker: web pages
    Note over Worker: extract segments and embed
    Worker->>Vector DB: new segments and embeddings

And would simplify the front-end application stack to:

sequenceDiagram
    actor User
    User->>Proxy Server: GET /
    Proxy Server->>User: client app
    User->>Proxy Server: GET /api/search
    Proxy Server->>Vector DB: search query
    Vector DB->>Proxy Server: top-k segments
    Proxy Server->>User: search results

When choosing a Vector DB solution, it must have these requirements:

Supports keyword search
Supports semantic search via vector search
Supports metadata filtering (e.g. filter for only BoM segments)