external search index - Githubissues

SoniEx2 commented 1 year ago

we'd like to be able to get an external search index instead of tying it entirely to the publ database. because we don't like crawlers and we want to share the index with other ppl who don't like crawlers so we can all benefit.

fluffy-critter commented 1 year ago

I am super unclear on how that would work. Could you propose a technical solution? In particular, how do you propose that the search indices be shared, especially in a way which doesn’t require essentially providing an entire site mirror and in a way which preserves privacy on restricted content?

I’m not sure if this would belong in Publ directly, rather than being some sort of indie search protocol people can submit their stuff to. For example, https://indieweb-search.jamesg.blog/ (which uses a crawler but it’s restricted to IndieWeb sites in some way that is unclear to me).

SoniEx2 commented 1 year ago

we don't have a spec, we only really have a goal. and obviously the exported index should not include restricted content.

but we extremely do not want crawlers. why can't we submit the processed search index (ngrams? trigrams?) and let the neighbors (as we're calling them - as in "neighborhood search", tho we wouldn't be opposed to "webring search" either) figure out how to pull that into their own search boxes?

crawlers are a waste of time and energy. sure you don't know whether the index is truthful but that's only a problem if you don't trust the website. which is only an issue if you're operating google. this does not apply to allowing friends to pull your index.

crawlers are the bitcoin of search. they should not have been tolerated for as long as they have. thankfully that has been changing lately, especially with the advent of large-scale data appropriation engines ("AI").

TL;DR: we have come to hate crawlers. we don't have a solution but we'd love to experiment with possible solutions. the exported index should not include restricted content tho.

fluffy-critter commented 1 year ago

At present, Publ uses whoosh for its search indexing, and I don't think that the data it exports would be portable between systems. I also feel like this might be outside the scope of Publ. If there's ever a project which allows people to submit their search data directly for inclusion into a wider index, it would be possible to make some sort of mechanism for Publ to export into that, but at present I'm just not clear on why this would be something that belongs within Publ itself.

fluffy-critter commented 1 year ago

Closing per the discussion at https://chat.indieweb.org/dev/2023-08-21#t1692638758764400 - while the idea of having independent mutually-hosted search has a lot of merit, I don't think it's the sort of thing that belongs in Publ.

PlaidWeb / Publ

external search index #544