KjetilIN / rs-search-engine

A custom search engine built with Rust. It parses HTML files and utilizes TF-IDF scoring to rank document relevance based on search queries. The project includes a Rust-based backend server and vanilla HTML/CSS for the web frontend.
MIT License
0 stars 0 forks source link
rust search-engine tf-idf

Rust Search Engine

Created by Kjetil Indrehus

version Rust


This project is a custom search engine built with Rust, designed specifically for gutenberg.org books. It parses HTML files from the Kubernetes website and leverages TF-IDF (Term Frequency-Inverse Document Frequency) for scoring the relevance of documents, providing accurate and efficient search results. The search engine is composed of two main components: the backend server and the frontend interface. The backend server, implemented in Rust, processes HTML files, tokenizes the content, and calculates TF-IDF scores to determine the relevance of documents based on search terms. The frontend interface allows users to input search queries and view the ranked search results.

Screencast from 24. mai 2024 kl. 17.35 +0200.webm

Key Features

Run Locally

Running the webserver:

cargo run serve

For indexing the HTML files to a file:

cargo run index file

For loading and viewing the files for the engine:

cargo run load

NOTE Set domain variable in ./frontend/script.js to 0.0.0.0:8080

Run with Docker

Running the application with docker is simple:

docker compose up --build 

(add -d option for running detached)

Screenshot from 2024-06-16 17-04-50

Read more about Docker Compose here.

NOTE Set domain variable in ./frontend/script.js to localhost:8080

Setup

There is two options for setting up the project.

  1. Use my files as documents (easiest)
  2. Setup your own search engine files

1. Use my files as document

  1. Unzip the pages directory locally:
    tar -xvf ./cache/pages.tar.gz .
  2. Re-index the documents
    cargo run parse file 
  3. Start the HTTP server locally
    cargo run serve 

2. Setup your own search engine files

  1. Create a list of urls that you want to index. Each url must lead to a html file form the www.gutenberg.org website. Store them with the url and title separated with a semicolon in ./cache/urls.txt. For example:
    https://www.gutenberg.org/cache/epub/57532/pg57532-images.html ; Passages from the Life of a Philosopher
    https://www.gutenberg.org/cache/epub/69512/pg69512-images.html ; The calculus of logic
    https://www.gutenberg.org/cache/epub/55280/pg55280-images.html ; An Enquiry into the Life and Legend of Michael Scot
    ....
  2. Create a pages directory
    mkdir -p ./pages/
  3. Download each html file and set the name of each file equal to file<INDEX>.html, where INDEX is equal to the line number of the file in ./cache/urls.txt. Store each file in the ./pages/ directory.
  4. Re-index the documents
    cargo run parse file 
  5. Start the HTTP server locally
    cargo run serve 

Search API

Searching is executed by sending a POST request to the backend, with the search query as plain text

Endpoint

POST /api/search

Response

The following is a sample response from the output:


{
    "results": [
        {
            "url": "https://example.com",
            "title": "Example Domain",
            "tf_idf_score": 0.00234 
        },
        {
            "url": "https://anotherexample.com",
            "title": "Another Example",
            "tf_idf_score": 0.00234 
        }
    ]
}

Resources

Term Frequency–Inverse Document Frequency (tf-idf)
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/

Kubernetes Website Repository:
https://github.com/kubernetes/website