epeters3 / gospel-search

1 stars 0 forks source link

Simplify Architecture #27

Open epeters3 opened 8 months ago

epeters3 commented 8 months ago

By refactoring to perform incremental updates instead of whole system re-ingests, we can eliminate the need for an intermediate datastore (i.e. MongoDB), and ingest directly from the Gospel Library site to the search engine. Also, by refactoring to use a Vector database instead of ElasticSearch, we can eliminate the need for a re-ranking service. This would simplify the Data Transformation Pipeline to:

sequenceDiagram
    actor Operator
    Operator->>Worker: PUT /ingest
    Vector DB->>Worker: current state
    Note over Worker: determine which pages <br/> haven't been ingested yet
    Worker->>Gospel Library: requests
    Gospel Library->>Worker: web pages
    Note over Worker: extract segments and embed
    Worker->>Vector DB: new segments and embeddings

And would simplify the front-end application stack to:

sequenceDiagram
    actor User
    User->>Proxy Server: GET /
    Proxy Server->>User: client app
    User->>Proxy Server: GET /api/search
    Proxy Server->>Vector DB: search query
    Vector DB->>Proxy Server: top-k segments
    Proxy Server->>User: search results

When choosing a Vector DB solution, it must have these requirements:

epeters3 commented 2 months ago

I've refactored the worker to use Chroma DB. I left off getting the documents to index into Chroma successfully in gospel_search/chroma/import_segments.py. I'm assuming the documents are the wrong time. It looks like they are already an array of batches. Each document is a talk or chapter, and is a dictionary with a single key: segments, which is an array of the segment objects.

epeters3 commented 2 months ago

I've migrated to use Chroma DB. Latency is now 0.9 seconds which is not ideal, but it's very nice to have the embeddings be persisted across start-ups now. I still need to remove MongoDB from the source code, and refactor the Proxy Server to use Chroma instead of ElasticSearch.

epeters3 commented 2 months ago

I've begun migrating the API to be Python-based instead of Next.js-based. Now the Python Api serves the built Next.js bundle. Left off migrating the Next.js API methods to Python.

epeters3 commented 2 months ago

Finished migrating the /api/search route over to the new Python back-end. Chroma DB latency is now very good (running embedding model on GPU and sharing it across all requests). Left off migrating my OpenAI flow to use Langchain in the Python backend.

epeters3 commented 2 months ago

I've finished migrating from MongoDB+ElasticSearch+NextJS API to using Chroma and a FastAPI-powered back-end. Also I'm now using Langchain instead of the OpenAI sdk.