biolds / sosse

Selenium Open Source Search Engine & crawler
GNU Affero General Public License v3.0
27 stars 3 forks source link

Feature Request - Exclude Documents from Search Results #5

Open JGtHb opened 3 weeks ago

JGtHb commented 3 weeks ago

As a user, I sometimes need to crawl pages that are required to map an entire site, but the indexed documents are not relevant for searching and should be excluded from search results.

Below are two initial approach ideas:

Search Parameter Defaults

  1. In the Administration screen, a new button titled 'Default Search Params' could be added. This could link to a new page that allowed users to specify parameters enabled by default for all searches.
  2. Optionally, users could set per-user default search parameters. I work in a single-user environment and not familiar with how multi-user setups work. Admins would need to set the default params for unauthenticated searches, if allowed.

Advantage: Simple implementation, allows users to quickly adjust parameters when searching if they want to temporarily add a page to search results. Disadvantage: Time consuming to exclude new pages from search results or do ad-hoc page exclusions.

Document-Level Setting

  1. In the Crawl Policies > Main page, a box for 'Exclude from Search Regex' could be added. In this box admins could specify regex that when matched would mark the document as 'Excluded from Search' when crawled.
  2. In the Documents > [Selected Document] > Main page, a field titled 'Excluded from Search' would be displayed with a true/false value and a button to toggle the saved value.
  3. In the Documents page, users could select existing documents and initiate an 'Exclude from Search' action. This would mark all selected documents as excluded from search, and the documents would not be returned when searching. A filter button on the right-hand side of the screen would allow users to quickly see documents included or excluded from search results for all users.
  4. Optionally, authenticated admins could see a button inline with search results (next to 'Cached') called 'Exclude from Search' to quickly remove a document from future searches

Advantage: Allows easy removal of existing documents that have already been crawled, and configuration of future exclusions when setting up a crawl job. Disadvantage: More time consuming to re-add a page to search results if it was incorrectly excluded. Likely more complex to implement.

biolds commented 3 weeks ago

Thanks for the high quality ticket and suggestions. I like the Document-level approach, it's feature rich and not too hard to implement, I look into it when I find the time.