StractOrg / stract

web search done right
https://stract.com
GNU Affero General Public License v3.0
2.13k stars 48 forks source link

Repeated results from the same site. #139

Open eliasp opened 7 months ago

eliasp commented 7 months ago

When searching for certain keywords, that can be found in the documentation of a versioned library/application/framework, I often see the same result over and over and over again, where the only difference is the available version in the documentation, e.g. see the screenshot of a search for "kcl" "dict" "schema" below: image

The hard part might be detecting those as "same results, but only the most recent version/latest is releveant" and I lack the knowledge to suggest what to do about this implementation wise. From a UX PoV it would probably make sense to hide those duplicates behind a "Show more similar results" fold-out or so.

Might be related to #51

eliasp commented 7 months ago

I just discovered the "Copycats removal" Optic which somehow helps here, but also removes the original result of the latest version of the documentation and shows a completely different set of results instead.

mikkeldenker commented 7 months ago

There currently is some soft deduplication based on the url, title and body. Essentially if a result has a title with a very high similarity to a result title that's higher in the list, then the lower result get's deprioritized a bit. I think if we had more results in the index that matched your search terms, then it would look a bit better as I am pretty sure the older versions would be deprioritized based on their title similarity and body similarity with the top result. I agree it would probably be a good idea to hide very high similarity results behind some kind of button at the end of the search results here.

It's a very interesting problem to detect which documentation that points to the latest version. Right now, the ranking would probably rely fully on the harmonic centrality values to try and figure it out, but we might need to write some custom logic here. I don't exactly know what the best way to implement it would be yet.

Can you elaborate a bit on the optic problem? The "copycats removal" optic doesn't seem to remove the results from kcl-lang.io for me.