LemmyNet / lemmy

🐀 A link aggregator and forum for the fediverse
https://join-lemmy.org
GNU Affero General Public License v3.0
13.27k stars 882 forks source link

Investigate dedicated full-text search with e.g. Meilisearch or Elasticsearch #3713

Closed poVoq closed 1 year ago

poVoq commented 1 year ago

Requirements

Is your proposal related to a problem?

The search in Lemmy could be much improved by utilizing an existing search middleware that indexes in memory. This is common practise in other Fediverse software and as an optional feature useful for bigger instances.

Describe the solution you'd like.

Most commonly Elasticsearch is supported, but the Opensearch fork or the alternative Meilisearch are also good.

Meilisearch already has an easy to use integration for Rust: https://github.com/meilisearch/meilisearch-rust

Describe alternatives you've considered.

n/a

Additional context

No response

phiresky commented 1 year ago

probably better to start by using postgresql FTS and only move to more complex solutions if that actually turns out not to be enough.

marsara9 commented 1 year ago

For the short term there's always my project: https://www.search-lemmy.com which uses postgresql FTS just like @phiresky suggests.

It only supports posts and communities at the moment, but more will come in the future.

Long term, @phiresky feel free to steal the queries that I have in my project. In theory Lemmy would just need an additional calculated column on each object to store the tsquery data and then you'd just query against that.

poVoq commented 1 year ago

The main postgres database seems to be already a bottleneck and search is a database heavy operation.

Offloading that to a separate system would benefit larger instances a lot I think, and Meilisearch comes with quite a lot of additional benefits like fuzzy search and spelling error correction.

marsara9 commented 1 year ago

@poVoq if you think this best remains as a separate service, can you raise an issue on https://www.github.com/marsara9/lemmy-search and I can explore if Meilisearch might be a better fit?

As I assume that's what you're suggesting with "a separate system?"

phiresky commented 1 year ago

The main postgres database seems to be already a bottleneck

The database is not a bottleneck currently - poorly written and redundant queries are a bottleneck. We're far from the scalability limits of PostgreSQL.

poVoq commented 1 year ago

@marsara9 no, I meant a separate system interfacing with Lemmy directly similar to how Pict-rs is used for media, which is what my original proposal is about.

marsara9 commented 1 year ago

Gotcha. Technically the project I'm working on can run alongside Lemmy itself. I'd just need to redo the crawler to use the existing DB rather than relying on API calls to fetch the data.

Assuming you use the existing DB, again you just need a new column on each object type to store the tsquery data that you ultimately query against. Query times this way are also sub-second. You can test that out by using my website and just make sure not to apply any filters (those will slow down the query a bit).

The other catch is this can be a local only search. If a remote fetch is required the query time will go up substantially.

phiresky commented 1 year ago

Probably don't need new columns at all for basic search, just an index ON to_tsvector(body) is enough

marsara9 commented 1 year ago

I tried that in my project and performance was abysmal. Once I added a computed column I was back to less than a second for all queries (that didn't involve one of my custom filters)

phiresky commented 1 year ago

That doesn't really make sense, performance should be the same for searches. The index stores the computed value just as if you had added a computed column. Just INSERT and UPDATEs will be slower. Did you check the query plans? maybe your expression differed between the index and what you actually searched for.

The docs also mention you must specify to arguments to to_tsvector for it to work: Notice that the 2-argument version of to_tsvector is used. Only text search functions that specify a configuration name can be used in expression indexes (Section 11.7).

phiresky commented 1 year ago

Just tried it with:

create index on comment using gin(to_tsvector('english', content));
explain analyze select * from comment where to_tsvector('english', content) @@ to_tsquery('foo & bar');

full text searches take 0.5ms for a production instance with 700k comments.

marsara9 commented 1 year ago

Ya, I probable had something miss-configured the first time I tried setting it up.

But I'm currently using

ALTER TABLE posts 
    ADD COLUMN IF NOT EXISTS com_search TSVECTOR
    GENERATED ALWAYS AS (
        to_tsvector('english', \"name\") || ' ' || to_tsvector('english', coalesce(body, ''))
    ) stored;

and

CREATE INDEX IF NOT EXISTS idx_search ON posts USING GIN(com_search);

And it's working beautifully, so no reason to mess with it at the moment.

ancientmarinerdev commented 1 year ago

This is a critical point that I made on an PR, and probably should have made on an issue. IMHO, this is critical medium to long term once efficiencies and bottlenecks have been tightened up. I copied across a few of my comments.

https://github.com/LemmyNet/lemmy/pull/3719#issuecomment-1652546959

"Should search directly search DB? Shouldn't something like elasticsearch be considered so it can sit in front of/alongside the persistence? It might open more options for caching also. If some common queries can be cached, it could be reduced, even if it is a very short lived cache. The more elaborate the search, the less likely there is to be an hit, but splitting queries out or any technology that can help support that might just cut a significant load from the DB which will help massively with scalability. Someone will inevitably try expensive queries to kill servers. Being defensive here could be important.

There is probably some investigation that could go into this such as types of queries, replication of data etc. Community names could probably be quite an easy cache as it's a commonly undertaken activity."

https://github.com/LemmyNet/lemmy/pull/3719#issuecomment-1652683080

"It would add complexity, that is something I am not going to disagree, but I don't think they add unjustifiable complexity. Most websites running at country scale will use a Lucene, Solr, or more recently Elasticsearch for covering search. Taking load off the database is critical, because all extra concurrent load makes all transactions slower until it gets to the point it buckles. Even if it doesn't buckle, the delayed reaction times impact on users and their experience and eventually those doubts start to build up about whether this is good or reliable.

I suggested the following because from what I have seen, at scale these technologies are favoured. I don't know any large website that allows database searching without any form of caching."

"Most top websites when accessing immutable content, will try to cache first to cut load. If a query is run more than 96 times a day, a 15 min cache is going to provide a reduction in the amount of work, assuming they are evenly divided. They are returning a result rather than doing the same computations again. Yes, the data can maybe be stale, but who needs the data to be that real time for search. Even an hour cache is hardly an issue from a search perspective.

In tech, it's important to use the best tool for the job, it isn't always advisable to stick with simple stacks when the demands are greater. The last few weekends, there has been bad actors really testing the limits of Lemmy, and they seem quite motivated. By allowing search straight onto the DB, you're putting the DB in the hands of bad actors which is a very risky move. So far, it's not going smoothly. They're going to keep probing and poking where there is weaknesses."

WayneSheppard commented 1 year ago

Is there any evidence that Searching is more than a 1% load on the DB? 1000 searches per day is nothing compared to 1000 inserts per second from Federation. IMO, a better idea of the performance impact will help prioritize this request.

Nutomic commented 1 year ago

This is unnecessary for now.

codenyte commented 5 months ago

Perhaps this could get implemented as a plugin, using the new plugin API (#4695)