TryGhost / Ghost

Independent technology for modern publishing, memberships, subscriptions and newsletters.
https://ghost.org
MIT License
47.53k stars 10.37k forks source link

[Discussion] Solving Search in Ghost? #5321

Closed ErisDS closed 5 years ago

ErisDS commented 9 years ago

The issue to add Search to the Posts API, is literally the oldest open issue in this repository. It’s a specialist subject, so we’ve always hoped someone with specialist knowledge would come forward to help us solve it, but unfortunately that hasn’t happened.

To try to increase visibility of the problem we set up a page advertising contributor roles - detailing that we were looking for someone to help solve adding search to Ghost. That didn’t work either :(

Unfortunately, adding search is a particularly complex challenge. This is another attempt to try to get engagement from the wider OSS community on how to do this - we’d really appreciate the opinions and input of anyone who knows their search-fu.


Use cases

There are two key use cases for search inside of Ghost itself - one being in the admin panel to find a post, tag, user or perhaps even a setting that you want to change. This would make a significant improvement to the usability the admin interface.

The other is adding search to the frontend of a blog, so that it becomes easier to find content. The main things that need to be searchable are post titles and content, tag names and description, user names, and probably meta titles and descriptions for the different resources. There are plenty of use cases where other things would also need to be searchable, so this will need to be extensible, but these are the key fields.

Approaches

The best way to make both of these use cases possible, as well as making the search feature available to any apps or other extensions of Ghost in future, is to make it available through the API.

There are generally two approaches to solving search that would work in Ghost that I am aware of - the first being to use the FTS features of SQL and the second being to use a third party search tool of some description.

The upside to FTS is that, if we get it working well, it could work for all Ghost installs without any need to install additional (and likely complex) dependencies. The downside is that getting FTS to work for all three of sqlite, MySQL and PG is a sizeable challenge & the status of FTS in knex is unknown - but perhaps it can be done as a bookshelf/knex plugin, in a way that will benefit the wider community as well?

The upside to using a 3rd party tool, is that there are many modern and advanced systems that we could leverage to make an exceptional search feature. The downside is that most of them require additional complex dependencies* to be installed, which mean they aren’t viable options for Ghost core.

One of the tools that has been recommended is lunr.js, and there is a plugin for Ghost themes which combines lunr & the RSS feed to make search possible on the frontend. Supposedly lunr also works in node, not just the browser, but I’ve not found much information about it.

Other tools that are popular for search include lucene and elasticsearch and there are also numerous modules for node built on top of leveldb and redis databases. There are likely solutions out there that I haven’t heard about, so I’d be interested to hear ideas from others.

* complex dependencies includes binaries or external services - Ghost is currently installable on almost any platform without too much fiddling with compilers etc, and we want to keep it that way!

Moving forward

The ideal solution, in my mind, looks something like:

As part of Zelda, we want to introduce a search bar to the admin interface which provides auto-complete style results for posts, tags and users. This should be possible using our existing API, some minor modifications (to allow us to fetch certain fields) and Ember. It’s not ideal, but I think something incredibly rudimentary is better than nothing at all! This will be spec’d in a separate issue.

Continuing this discussion

I’m really, really looking for the community to get involved and share their ideas here. This is a relatively specialist subject and I know there are tonnes of developers out there that have a great deal more experience implementing search than I do.

I’d like to hear thoughts, ideas and experiences on FTS - will we be able to make it work for all 3 databases without having to write too much custom code? Could a bookshelf plugin work? Is it an absolute nightmare not worth pursuing?

What about lunr? Is it a viable server-side option? Does it have too many requirements or not work for large data sets, or is it the perfect basic option provided it can easily be overridden with elastic search or something else?

Is there another solution that isn’t mentioned here? Remember we can’t have leveldb or redis or other complex dependencies in core and it has to work across sqlite3, mysql and postgres.

And what about the API? What would a good search API look like? Who out there has a great RESTful search API that we should take inspiration from? If you know someone you think could answer any of these questions, please link them here and ask them to get involved!

Thanks :)

riyadhalnur commented 9 years ago

I haven't seen any viable option out there yet except lunr. This look promising norch but the documentation needs more work. Norch follows almost the same syntax as lunr; so for anyone to start with it should be fairly easy. My 2 cents

ErisDS commented 9 years ago

@riyadhalnur norch fails the requirement of not needing to install complex dependencies as it is based on levelDB.

CWSpear commented 9 years ago

@nichcurtis is a search master. He says Sphinx fits your bill and has an npm module, but he hasn't used it with node before.

Pinging him to try and get him involved!

olivernn commented 9 years ago

Lunr should work just as well in node as it does in the browser. It doesn't have any dependencies, so should be straightforward for ghost users to set up.

I don't know the scale of the indexing/searching you are trying to achieve but lunr can scale to fairly large corpus sizes, though it is never going to compete with the likes of Solr, ElasticSearch etc. I'm more than happy to try and help with any load testing you might want/need to do though.

Let me know if I can be of any help helping you decide, even if it is to say that lunr isn't the right fit.

CWSpear commented 9 years ago

Lunr's scalability would be directly related to memory, correct? Isn't it all stored in memory when it does the searches? 100MB worth of data is a decent sized blog, but not unfeasible, and since this app doesn't normally have a huge need of memory (I'm running my blog on a 500 MB VM), it could definitely be a problem, no?

ErisDS commented 9 years ago

I imagine that for very large blogs, the memory footprint could get problematic, but whether or not this is a serious issue for 'most blogs' would depend very much on the relationship between amount of text & the size of lunr's footprint. @olivernn some sort of testing around lunr.js memory footprints for an average blog would be really helpful here I think - I'm sure we could muster up some anonymised stats on what an average blog looks like if needed.

In terms of solving this elegantly, an idea we're working towards at the moment is to start delivering 'internal apps' - where bits of new functionality are delivered as an 'app' which is automatically installed and either enabled or disabled. Search could be delivered this way, as an app which is always present but disabled by default.

The benefit to this is that not everyone wants or needs search for their blog - so if you're happy without it, you can leave it disabled. If you want it you can enable it, and if it's too memory intensive on your blog you can (in theory) swap it out for the 'ElasticSearch' app instead. There are a lot of puzzle pieces missing for apps still, but it's worth knowing this is the way we intend to move forward.

olivernn commented 9 years ago

@CWSpear lunr is an in memory search index, and its memory usage is going to be related to the number and size of documents being indexed, so memory use may be an issue for larger blogs.

@ErisDS I can put together some benchmarks of lunr's memory footprint if you can provide some representative sample data. It will at least give you a data point to help you make the decision.

I don't know Ghost at all but from what I see it is using SQLite as a datastore? Doesn't SQLite have a full text search index that could be used here? Or am I completely misunderstanding the architecture of Ghost.

halfdan commented 9 years ago

@olivernn Ghost uses SQLite3 by default, but also supports PostgreSQL and MySQL. As @ErisDS described we would need a solution that can cover all three databases.

kowsheek commented 9 years ago

:+1: on the ElasticSearch suggestion. It has a huge community which is super active. Lots of hosting support as well. ES would work nicely for other languages as well.

Ideally, ES should come packaged, but give the option to use a built in node or one that's started elsewhere (or none at all).

Ditto what @leonkyr & @ErisDS says, especially since PostgreSQL has a pretty awesome FTS, the searching shouldn't be limited to ES, but may be the default.

The good idea would be to implement a microservice for search that can (re)index documents and search them. Inside it can be ES or FTS. ES supports text highlighting and a lot of other things.

ErisDS commented 9 years ago

it platform independent (you only need java)

 

Ideally, ES should come packaged

ElasticSearch would be awesome, but I'm really struggling to see how we could use it in Ghost in a way that didn't make it impossible to install?

CWSpear commented 9 years ago

@ErisDS I think your only option is Lunr (or something similar to it), because everything else at the very least requires a persistence layer (i.e. https://github.com/tj/reds) or a backing process (i.e. elasticsearch) (or both). Or they are specific to a particular database (https://dev.mysql.com/doc/refman/5.0/en/fulltext-natural-language.html).

So we need something that wraps around the the different databases you support/plan on supporting (SQLite, Postgres, MySQL) with a common API (i.e. if Knex ends up getting full support for it...), or something that puts everything it in memory like Lunr.

riyadhalnur commented 9 years ago

ElasticSearch as good as it may be, IMHO I don't see it as a possible solution for Ghost. To have to install Java to make it run for a small application is a big hassle among other things.

why couldn't we just use the indexing capabilities of the DBs Ghost supports and write a high level API for it so developers and users (there are some who will) can use it however they see fit? in other words, a polyglot framework. it's a lot of work but the payoffs will be immense.

this post here talks about how lunr.js can be integrated into a node app http://matthewdaly.co.uk/blog/2015/04/18/how-i-added-search-to-my-site-with-lunr-dot-js/ but it looks meh but I guess for basic search in blogs, it'll work just fine

ErisDS commented 9 years ago

I agree - Java cannot be a dependency for Ghost.

I also agree about a high level API - the way I imagine it would be that Ghost would offer a /api/search?q= endpoint from it's API, but how that search gets performed would be configurable. The underlying structure would be ideally to have a default that is built in to Ghost and then be able to plug in ES or Solr or whatever takes your fancy using an app.

What we need (and the intention of looking for a Search Lead) is to figure out how that will work. Firstly, which is the default? I think either lunr or FTS could be the default, but it needs someone familiar with implementing this sort of thing to do a bit of a test and spec it out. Second of all, how does the default work. Thirdly, what does the fine details of the API endpoint/ look like and finally how can an app hook in and override the search functionality whilst keeping the API the same?

riyadhalnur commented 9 years ago

ES and Lunr.js use JSON stores. i would see the system as generating JSON stores which an app can use (hook into) to power the API

nagug commented 9 years ago

Just a wild thought.. Can the raw queries of knex be used and a fts performed? Some amount of overhead would be there to form the query depending on the db used.

Not sure an app approach would help.. But app approach might introduce security issues. Thanks

codeincarnate commented 9 years ago

This thread has died down, but I wanted to hop in because it's crucial that Ghost gets this right. I'm going to talk about why, and then outline an approach that Ghost can take to avoid this.

I've implemented search on many Drupal based sites and it's easy to mess up. Way back, Drupal devs decided to implement a search solution on top of MySql and it was plagued with problems. The search results were ineffective and performance was terrible. It became common knowledge to simply disable it altogether. It was so bad that it was effectively unusable and the community adopted the use of Solr in droves.

This is definitely undesirable, and I believe this outlines the following requirements:

Backends

In regards to backends, there are a number that are available as a default choice:

Lunr.js is nice in that it has a relatively active community. However, Lunr has a couple of downsides. First, it's purely Javascript which could negatively effect performance. Second, Lunr stores it's index by serializing it to JSON. This would likely be slow, but also makes managing the index tricky.

Full Text Search (FTS) features of various databases could definitely be used and would likely have fairly solid performance. The downside of this is that it would be relatively difficult to code and maintain. It would effectively have to be implemented for MySql, Postgres, and Sqlite. Another more subtle difference is that this also means that the database begins storing metadata of the blog in addition to the blog data directly. This can make migrations and other tasks trickier since there's a portion of the database that could effectively be regenerated at any time.

Finally, search-index and levi. Both are implemented on top of LevelUp which is a LevelDB interface. At first glance, this would appear to introduce a node-gyp dependency, however LevelUp has pluggable persistence. This means that a solution built on top of LevelUp could use something like MemDOWN by default, and switch to something like LevelDown as performance needs increase.

It could also be possible to do some work to use node-pre-gyp for LevelDown so that there's no need for users to build binaries at all. In fact, this is already the case for Sqlite support in Ghost! This would also have the added advantage that all index data would be separate from the blog data itself.

Personally, I feel the LevelUp backed path is the way to go. It's an approach that allows for very straight-forward setup, is likely to have solid performance, is pretty full-featured, and avoids maintaining separate search backends (in the SQL FTS case).

Edit: Another nice feature of Levi is that it returns search results in streams which allows for all node stream related tools to be used for this.

Indexing

Thanks to bookshelf, it should be straightforward to hook into events that are occurring at the model layer. This is already being done and could be extended to meet indexing needs. There would likely need to be extra added to do index clearing and re-indexing as necessary.

Pluggable Backends

I'm not aware of everything that goes into Ghost apps, but it would be a good avenue for other backends. I'm sure this would be important for Ghost Pro and larger blogs so that a solution like Solr or Elastic can be used in that case.

I'm part way through doing a very barebones implementation of the above. Regardless of direction, I'd love to hear some feedback and thoughts.

codeincarnate commented 9 years ago

An update to my last comment, I have a super, super basic implementation here. This doesn't have pluggable backends or anything, just some indexing and the ability to pass in queries at the moment.

CWSpear commented 9 years ago

@codeincarnate looks like that's a private repo (or maybe just a bad link?) as I got a 404 when trying to visit it.

codeincarnate commented 9 years ago

@CWSpear Missing an "s" at the end, fixed now!

gergelyke commented 9 years ago

@ErisDS any update on this?

codeincarnate commented 9 years ago

@ErisDS would love to have some feedback. My business partner is getting married this weekend, but after that I'm going to have more time to really take a crack at this.

sethbrasile commented 9 years ago

@codeincarnate if you get into this, feel free to throw some tasks my way if you'd like.

codeincarnate commented 9 years ago

@sethbrasile That's a very generous offer and I'll definitely do so. Hoping to get some feedback from @ErisDS before taking a deep dive into this.

ErisDS commented 9 years ago

@codeincarnate thanks for the enormous writeup, and apologies for taking a while to digest it.

It seems to me that, providing that we can have a non-binary/in-memory dependency as the default, this approach looks like it may be a great solution.

Having a node-pre-gyp wrapper for the binary version is very desirable, even if it's not the default. I would recommend that we avoid having another default dependency which requires a binary, even with node-pre-gyp, because it does still add overhead both to users installing and to developers (switching between node versions, which is becoming commonplace during testing, is harder with binary deps).

I'd be interested to see some tests to show at roughly what point in-memory search becomes untenable. E.g. how many 500 word posts is the limit, assuming a DO $5 droplet / 512MB ram? This could help us to provide useful documentation on how and when to upgrade to 'advanced' search, as well as perhaps even add warnings inside Ghost itself.

Not sure if you've seen this, but here's the documentation for how to configure a custom storage module, which is a similar 'pluggable' extension to Ghost: https://github.com/TryGhost/Ghost/wiki/Using-a-custom-storage-module. I'd imagine the extra config for search would be similar - an npm install & a config.js update. This could be done in a more managed way maybe using the app platform in future, but for the time being this approach works.

The only major question I'm left with atm is why levi over search-index?

jberkus commented 9 years ago

PostgreSQL Expert, JS neophyte here, just putting in:

For Postgres, at least, full text search can be added in ways which are invisible to the user, and which work fine with backup/restore. Further, if you only support simple boolean operations (i.e. AND/OR words) and don't support more complex operators, it's trivially easy to write a library which would convert user-obvious syntax into the syntax supported by each FTS engine. The one user-visible operation would be requiring the user to choose a language, as FTS dictionaries are language-specific, and the dictionary would become an optional installation requirement.

MySQL is more complicated, especially if you need to support core MySQL and not require MariaDB or Percona. Core MySQL, AFAIK, does not support FTS for InnoDB tables, forcing you to make a rather substantial user-visible change which affects reliability. This is why prior to Maria/Percona, most MySQL users used Spinx search instead.

jberkus commented 9 years ago

The main reason to support SQL-based FTS would be in order to support large websites with several GB of text to search. Does anyone even use Ghost for that?

codeincarnate commented 9 years ago

@jberkus Awesome run down in terms of Postgres. For the most part, many people are likely to be running Ghost on a small blog and there's probably not going to be more than 100mb of text or so and even that is probably an over-estimation.

From the perspective of ghost, it can run on top of MySql, Postgres, or Sqlite. In terms of search, MySql is likely the one to hold it back. More than that however, dealing with a wide array of potential database configurations is not ideal for a default search function across small blogs.

As I mentioned above, this is the point of having search be a completely pluggable interface. If you're running Ghost on top of Postgres it makes a lot of sense to also have search in Postgres as well. For a service like Ghost Pro, running search on top of Elastic or Solr makes sense. For the easy case, something that is zero configuration and no worries is definitely for the best.

jberkus commented 9 years ago

Yeah, seems like a good approach. If you make it pluggable, I might write the PostgreSQL backend.

codeincarnate commented 9 years ago

@ErisDS thanks for the feedback!

I'm going to look into the image storage and how all that works. For reference, where is the code that implements that?

As far as levi vs search-index, I suppose I don't have a strong pull towards one or the other. I'd mostly be interested in testing both and see how they fare both in terms of performance and in terms of search accuracy. They're quite similar, so hopefully some action would spur the community towards adopting a solid option.

At the moment, search index doesn't have a license so that will need to be clarified at least. Both of them are also explicitly requiring LevelDOWN so that will need to be changed as well.

I'm going to start breaking out some issues and moving forward with this.

CWSpear commented 9 years ago

I believe there's been talk of dropping Postgres support, by the way.

Also, I don't think it's been discussed, and it may not be viable for this use-case, but I think it's definitely worth mentioning: there are search-as-a-service options available, such as Algolia. While some obvious cons, a major pro would be that (if you had an account), it'd work out of the box on a vast array of systems running Ghost (since all of those difficulties are abstracted away by being taking care of off-site).

jberkus commented 9 years ago

@CWSpear

Yes, that's why I got involved.

CWSpear commented 9 years ago

I mean dropping Postgres altogether. No (official) support for Postgres anything for Ghost.

jberkus commented 9 years ago

Yes, I know, see #5878.

halfdan commented 9 years ago

@CWSpear The discussion is about dropping official support and leaving it up to the community to keep psql supported.

CWSpear commented 9 years ago

Alright, just making sure we were all aware =)

jberkus commented 9 years ago

<--- the community (plus, y'know, some other folks)

mike182uk commented 9 years ago

Hey guys, just thought i'd chime with something i had been playing about with: https://github.com/mike182uk/Ghost/pull/1 (raised a PR on my own fork, just to make it easier to see what is going on and what changes i've made).

I have basically taken the same approach discussed above with having pluggable backends (strategies). My example shows how posts can be searched using Lunr as a strategy.

You can search for a post by adding the search param to the querystring

http://<base_url>/ghost/api/v0.1/posts/?search=<query>

The index is updated whenever you add, edit or delete or post.

The search strategy is configureable via the config, so this should make it easier to extend and add other search strategies (i.e algolia).

This is just a working PoC and would need more work (indexers etc.) but maybe of help to somebody :smiley:

codeincarnate commented 9 years ago

@mike182uk cool! I've taken a look at the code and it's broadly similar to what I've been doing. I'm not sure how good of an idea it is to call it a strategy, and there is definitely more that needs to be done in terms of loading the index on startup for instance. I'll look at it more in depth over the weekend.

jberkus commented 9 years ago

What will the search queries look like?

mike182uk commented 9 years ago

@jberkus with the approach i've taken, the main query is altered to use WHERE uuid IN (<uuid>, <uuid>, <uuid>). This means that an external system (i.e lunr, solr, algolia) would return the UUIDs of documents that matched the query, then these are passed through to the query that retrieves the data from the db

@codeincarnate I called them strategies as its the strategy pattern that is being used

jberkus commented 9 years ago

@mike182uk I meant "what does the search query text look like"? Rather than the follow-up query.

jberkus commented 9 years ago

I'm thinking in terms of how I would write a search plugin, so I'm curious what the search syntax looks like.

mike182uk commented 9 years ago

@jberkus ah i'm not sure, i don't think its been mentioned here. My implementation just takes a string from the query string (http://<base_url>/ghost/api/v0.1/posts/?search=<query>)

@codeincarnate I've added an indexer to my implementation now if you want to check it out. This means that it will be up to the indexer to know how to index models. In the case of my Lunr.js example, when the indexModelCollection method of the indexer is run, the indexer could load a disk saved Lunr.js index to save having to re-index everything when ghost is loaded. I would imagine this sort of thing would be different between strategies.

How does everyone feel about the approach i've taken? I would love to get some feedback (@erisds maybe?). I'd be happy to more time into this if i'm going in the right direction :smile:

If anybody wants to test out my approach, just fork my fork, or git checkout https://github.com/mike182uk/Ghost/pull/1. Once you have started ghost, the post models will be indexed and you can query them via the API:

http http://<base_url>/ghost/api/v0.1/posts/?search=<search_query> 'Authorization:Bearer <auth_token>'

Adding new posts, updating or deleting existing posts will update the index.

jberkus commented 9 years ago

Right, I'm asking: what goes in place of the placeholder search_query? What does that look like?

ErisDS commented 9 years ago

@codeincarnate code for storage modules lives here: https://github.com/TryGhost/Ghost/tree/master/core/server/storage. Issues that are worth reading for some context inc #4600 & #2852 - I like the idea of trying out both levi and search-index to see which is best.

@mike182uk awesome that you have a similar implementation, I have taken a very quick look and it seems like all the pieces are plugging in in roughly the way I'd imagined. I'm interested to know why you picked lunr?

@jberkus I recommend taking a look at GQL: #5604 - which is used for filtering - for some inspiration for what a search query might look like. I'd imagine we'd want to do things a bit differently for search - but it's the only even related spec that exists in Ghost atm. In order to determine the actual syntax, we'd need to determine what the basic set of queries that most 'strategies' can support is, as well as consider how we could extend this out from posts to include more resources, and go from there to come up with a plan.

One big question I guess is what to do with the custom 'strategies' & how to manage configuring them - should it be exactly the same as 'storage' for now, e.g. use /content/search/? We don't have enough pieces to do this via an app just yet.

Do any of these modules which use memory do any sort of reporting on how much memory they are using? I'm wondering about what we can do in terms of managing the potentially enormous memory usage jump we're going to get.

jberkus commented 9 years ago

The GQL thing is cool, I hadn't seen that.

However, it doesn't define a syntax for full-text searches. Personally, I like to keep things simple, so I'd vote for a syntax which accepts:

 "brunch breakfast eggs"

which would be equivalent to:

 "brunch and breakfast and eggs"

REST searches would use this syntax:

 "brunch&breakfast&eggs"

Also support basic OR searches:

 "brunch or breakfast or eggs"
 "brunch|breakfast|eggs"

I don't think we want to support AND/OR groups, etc. At least not right away.

jpwynn commented 8 years ago

I'd like to make a Modest Proposal (in the Jonathan Swift tradition), acting on the assumption there's still no meaningful progress on this issue. It is just so unbelievably bad to not have any search, not even a dumb keyword lookup, that you might consider something like the following:

Let "making search better" become the oldest open issue if circumstances and complications dictate, because it's a lot better than having no search.

IMO, you are letting 'ideal' be the mortal enemy of 'adequate for most bloggers'.

jberkus commented 8 years ago

FWIW, "ILIKE" should be database-independent, at least for the two supported and one quasi-supported database.

mike182uk commented 8 years ago

I've been away and not had chance to look at this recently. 0.7.2 made some changes that has required me to change my implementation slightly. I have changed how i am filtering posts to use the new filter functionality - basically you still pass a search query along the query string when making the http request, but this now gets converted into a filter. If there is already an existing filter, it is appended on to this. I've also made it so the indexer configuration is coming from the main config now. You can see my implementation here (feedback is welcome!)

@ErisDS as for why i chose lunr - its by far the easiest solution to integrate and has the least dependencies. The main concern with lunr is the memory footprint as the index is stored in memory. Its a bit hazy at what point that really becomes a problem (i.e for the average blog i don't see this being a problem).

I was trying to tackle the problem with a more pragmatic approach of something is better than nothing :) With the approach i have taken, it should be easy to write other strategies for other technologies without having to change the underlying implementation (should just be config change).

mike182uk commented 8 years ago

I've done abit more work on my PoC (https://github.com/mike182uk/Ghost/pull/1) and added a strategy for using search-index instead of lunr (search-index is the one that uses leveldb). npm install installed everything necessary to get this working. If you want to test this out, once you have pulled in my latest changes, in config.js change search.strategy to search-index. The db is stored at content/data/ghost-search-index. This verifies that the solution is easy to extend for other search technologies :)

I've changed my implementation slightly to now use promises when returning search results, and the search query parsing is now done as part of the pipeline.