Search epic - Githubissues

WorldBrain / Memex

Browser extension to curate, annotate, and discuss the most valuable content and ideas on the web. As individuals, teams and communities.

https://worldbrain.io

4.42k stars 338 forks source link

Search epic #760

Open jonathan-s opened 5 years ago

jonathan-s commented 5 years ago

List of issues related to search.

[ ] Sorting by relevance #616
[ ] Not clearing when applying filters #605
[ ] Fuzzy matching / improving accuracy of search #595
[ ] Incorrect time filter for search #562

Would say improving the accuracy of search is really important. As an example I've got this wikipedia article in the db.

https://en.wikipedia.org/wiki/B_Corporation_(certification)

In memex I search for "weighting search". That article shows up. There are three mentions of weighting. No mention of search (does it pick up "search wikipedia"?). Nonetheless "weighting search" implies AND not OR, so the article shouldn't show up in the first place.

blackforestboi commented 5 years ago

thanks @jonathan-s for collecting those search improvements and bundling them here.

looking at the source code of theoretically visible text it picked up "jump to search".

We indeed could work on some better ranking. We already collect some interaction metadata like visit frequency, time stayed and scroll % and could already on that improve some things.

Further we are working on getting storex, the underlying storage layer into the local system so we can make use of more sophisticated search tech, maybe not even based on JS.

Do you have any ideas for quick improvements that we could add that would drastically improve the result for your search queries?

jonathan-s commented 5 years ago

I haven't had a deep dive into the code. But I wouldn't be too surprised if you could drastically improve accuracy if you implemented some weighting into to the search. Keywords that are mentioned more often could for instance signify that it is a more relevant term for the document.

Also the it does not seem to exclude documents that don't contain a keyword when you search for "term1 term2". Ie document 1 contains both terms, document 2 only contains one of the terms. Therefore document 2 should be excluded.

blackforestboi commented 5 years ago

Also the it does not seem to exclude documents that don't contain a

It should though. In cases like in the OP its because the keyword is in the HTML, but hidden in the initial view. Therefore it happens that such terms are indexed too and cause some noise then. Or do you have other instances where that is not the case?

jonathan-s commented 5 years ago

Another example of weird indexing which seems to then get reflected into search. https://twitter.com/41Strange/status/1070040073482067969 for some reason "cow" is being indexed for that link. I can't find it in the source even.

On the same theme -> https://news.ycombinator.com/item?id=18128477 Lots of content. The word cow occurs exactly once in this text. Perhaps it should not be indexed? In either way not all words in that page carry equal weight. cow is certainly not on the top of the list of words I would associate with that page.

cktang88 commented 4 years ago

Just wondering, has there been any updates to the ranking/searching/indexing accuracy?

blackforestboi commented 4 years ago

No, actually we recently decided to pivot on this feature a bit to postpone the need for structural improvements. The search is quite a heavy lift for the application and has so many things it makes more difficult: Backup & sync size/performance/reliability, search performance, running costs.

We realised that search is not our main value proposition with which we can really move the needle for the company to become profitable/sustainable. It's sharing/collaborating and integrating into existing workflows with tools like Roam, Notion, Evernote etc. Search will serve that purpose and will still exist, but in its current form has consumed too much of our resources in the past 2 years in order to be deemed viable to be the main value proposition.

Our decision therefore was, for the time being until we have the resources, to limit search only to pages that have been actively bookmarked, tagged, listed or annotated.