askorama / orama

🌌 Fast, dependency-free, full-text and vector search engine with typo tolerance, filters, facets, stemming, and more. Works with any JavaScript runtime, browser, server, service!
https://docs.askorama.ai
Other
8.26k stars 273 forks source link

Extend Crawler queries by a custom "data-orama" attribute #722

Open fabiobiondi opened 1 month ago

fabiobiondi commented 1 month ago

Problem Description

We are trying the Crawler and and we noticed that our Next 14 site is not being indexed.

The problem is probably that we have many nested components that render texts inside <div> instead of <p>. I realize that it's not the best in terms of accessibility and semantics but we have this need.

Looking at the source code (general-purpose.ts) we realized that the contents of the <div>s are totally ignored.

https://github.com/askorama/crawly/blob/2892e473775a408495d07a0dea016ec23a85d362/src/general-purpose.ts#L34-L51

In fact I and @gioboa did a test modifying your function, adding <div>s to the query, but dirt and non-useful DOM elements were also indexed. So it doesn't seem like a decent solution.

Proposed Solution

We thought an interesting idea might be to let users decide what content to index outside of your rules.

A very simple hypothetical solution could be to insert a data-orama attribute on the elements to be indexed into the site you want to index and extend the crawler to also query those elements.

<div data-orama> content </div>

I think it might be a simple, clean and powerful way to extend it.

What do you think?

Alternatives

Another future solution could be to allow the crawler function to be completely customized by the users

Additional Context

No response

SaraVieira commented 3 weeks ago

hey!

This is a great idea! I made a pr to the repo to add custom selectors and will ping when this is merged and these options are also added on the website

gioboa commented 3 weeks ago

Thanks @SaraVieira