Add full text search - Githubissues

KyleMaas commented 3 years ago

Adds full text search functionality to the upper right search box.

The way this works is that when the app is started, it creates the MiniSearch object but waits to index. A live query is run to watch for new messages and index them. But since indexing is a low-priority task, indexing of existing messages runs in batches of 1000. Old message indexing starts 3 seconds after page load, and then processes a batch of 1000 messages every 5 seconds.

Fixes #164.

KyleMaas commented 3 years ago

Thinking if this works, I could probably also use MiniSearch to index mentions, which would make it so I could do blob searches (#124) that way instead of needing ssb-meme. And the fact that this is a low-priority background task doesn't matter, because blob search would be similarly low-priority.

arj03 commented 3 years ago

Please don't take this the wrong way. I'm trying to be constructive, but this feedback might sound a bit hard.

There are a few problems with this approach:

It indexes everything on every load, since there is no persistance
indexExistingPosts will index the same post again if you get new data while it is indexing
There doesn't seem to be a way to know har far it is on the indexing. So if you get 0 results you don't know why

If you want to do something like a background task then maybe look into the https://gitlab.com/staltz/too-hot module. The persistance is really important, I'm not keen on adding stuff that slows down the app.

KyleMaas commented 3 years ago

I'll see what I can do. "Progress" without controversy is generally progress without thought, so I appreciate the feedback even if it's negative.

Most of the full text search systems I'm finding do support persistence. But what I really don't want to do is an index where it re-persists the entire full text search state every time it processes a message. I've had SD cards in Raspberry Pi machines ruined due to careless programs doing too many small writes/overwrites, and we don't need to be doing that to mobile flash storage. So...I'll see if I can find something that works.

KyleMaas commented 3 years ago

So, the more I look into solutions like what you're describing, the more problems I'm having with this. Introducing an index of everything in the database (the traditional persistent way to do this) introduces processing delays on initial sync as it's lexing the messages. I don't want to slow that down for something that's likely infrequently used. My goal here was not to parse every message in the database to search them, but only the recent ones - in this pull request, I stopped the search after a number of messages were processed. This is a low-priority feature which would be infrequently used. Valuable to have when needed, but not something to risk degrading a new user's experience for. Most of the solutions I'm coming up with for persistence either require periodic writing to disk (risking losing your place and either missing items or reprocessing them if the browser was refreshed), or writing to disk for every message, or periodically reprocessing the latest messages for a new index. Most of the full text searches have very slow removal rates, so keeping a sliding window of the latest X messages would be very expensive. And it means the persistent index storage wastes storage space and puts us closer to hitting browser limits on storage, particularly for a full text search where the index after lexing could potentially be huge. So I'd rather not persist a full text search index of everything and sliding window looks untenable. This, to me, is a very bad solution to accommodate a rare search.

So, as a possible alternate solution, I just posted a commit that makes it so it indexes on demand and changes it to an asynchronous search instead of synchronous, so it can display that it's searching while it's doing so. Initial index of the latest messages results in a delay of about 8 seconds for me, which means the first search is a little slow. But then searches after that are very fast. It keeps track of the most recent message it has indexed so that it does not index past it again, so there should be no reindexing. And for something as infrequently-used as I expect this to be, I don't have a problem with the first search taking a few seconds.

Would something more like this work?

arj03 commented 3 years ago

Cool. I like this latest approach much better. It would be good if it could display that it is only searching within the latest 10k messages. Also if I read this correctly, it will only do something when you use it right?

KyleMaas commented 3 years ago

I would agree. This is a better way to do it that my original method.

Correct. Only indexes messages when you actually run a search. And then it keeps the index around (in RAM - not wasting precious disk storage) for the next time you search. Future searches then only index additional messages which have come in since the most recent indexing operation.

I'll go ahead and add that number of messages info a minute.

KyleMaas commented 3 years ago

That adds a message explaining the search, allows the user to configure how far back to index, and fixes a number of bugs.

arj03 commented 3 years ago

Thanks :)

arj03 commented 3 years ago

Been testing this. It's pretty cool to see results with a very intuitive search. It seems like it uses quite a bit of resources. On my desktop computer it freezes the window for like 10 seconds while it searches. I tried changes it down to search for 5000 and only show 50 results, but still more or less the same.

arj03 commented 3 years ago

Also main dist js file is now 6.9mb. We should try and see if we can get that down a bit again. Would be nice I think.

arj03 / ssb-browser-demo

Add full text search #208