the previous searching tool was extremely naive, entirely using String.contains
there was no spell correction, nor an easy way to add it in
results came back in an arbitrary order and weren't easily scored (read: we weren't showing the most relevant post first)
we had little flexibility on making some fields 'nice to have' vs. 'required'; having the option could let us create more fine-tuned searching
A search engine ('SE' for short, used in the codebase) is a much more appropriate tool to use for searching and scoring documents. This change migrates the searching, scoring, and ranking logic to Opensearch to return a list of ordered SearchItem instances, which we then use to return a list of ordered PostItem instances to the user.
Why Opensearch?
Opensearch is an open source search engine forked from Elasticsearch, so has the benefits of being more approachable to developers with a familiarity with ES than something like Solr. Although it's not the most straightforward app to run from Docker, it's not too painful to set up locally.
We could use any search engine here, as we're not doing anything complex enough to strain the boundaries of one particular tool.
How are we using Opensearch?
The existing GET /posts?... URL stays the same, but instead of querying the DB with a whole bunch of boolean filters we now query the SE with a combination of fields (details available on request); the SE returns the resulting documents in an order from most-least relevant, and we use that order to return PostItems back to the user in the same order.
Why a SE and a DB at the same time?
The DB is a persistent data store for posts stored more-or-less exactly as the user uploaded their content, plus extra metadata for us (e.g. createdAt, updatedAt). The SE is a tool that manipulates data to the most easily searched form, often by doing compute work and storing it on disk (trading memory space for search speed).
While we could use Opensearch as a persistent database, we'd need to maintain very clear boundaries between user data and indexed fields used for searching - one of the main benefits of using a search engine is it's ability to manipulate data into fields for faster or more powerful searching. Storing these alongside the core user data has the potential to get a bit messy, and we'd be indexing data (such as reportCount) that would never be searched.
Code flow
For anything that doesn't involve searching, the code stays exactly the same:
For a search request the API hits the SE, the SE returns a list of PostItem IDs to the API, and the API loads those documents from the DB:
Creating, updating and deleting a post basically duplicates the existing flow for the DB onto the SE:
Overview
The core details are:
String.contains
A search engine ('SE' for short, used in the codebase) is a much more appropriate tool to use for searching and scoring documents. This change migrates the searching, scoring, and ranking logic to Opensearch to return a list of ordered SearchItem instances, which we then use to return a list of ordered PostItem instances to the user.
Why Opensearch?
Opensearch is an open source search engine forked from Elasticsearch, so has the benefits of being more approachable to developers with a familiarity with ES than something like Solr. Although it's not the most straightforward app to run from Docker, it's not too painful to set up locally.
We could use any search engine here, as we're not doing anything complex enough to strain the boundaries of one particular tool.
How are we using Opensearch?
The existing
GET /posts?...
URL stays the same, but instead of querying the DB with a whole bunch of boolean filters we now query the SE with a combination of fields (details available on request); the SE returns the resulting documents in an order from most-least relevant, and we use that order to return PostItems back to the user in the same order.Why a SE and a DB at the same time?
The DB is a persistent data store for posts stored more-or-less exactly as the user uploaded their content, plus extra metadata for us (e.g.
createdAt
,updatedAt
). The SE is a tool that manipulates data to the most easily searched form, often by doing compute work and storing it on disk (trading memory space for search speed).While we could use Opensearch as a persistent database, we'd need to maintain very clear boundaries between user data and indexed fields used for searching - one of the main benefits of using a search engine is it's ability to manipulate data into fields for faster or more powerful searching. Storing these alongside the core user data has the potential to get a bit messy, and we'd be indexing data (such as
reportCount
) that would never be searched.Code flow
For anything that doesn't involve searching, the code stays exactly the same:
For a search request the API hits the SE, the SE returns a list of PostItem IDs to the API, and the API loads those documents from the DB:
Creating, updating and deleting a post basically duplicates the existing flow for the DB onto the SE: