Sloth Search - A Google-like Search Engine Clone

Sloth Search is a project that aims to recreate Google, including crawling, indexing, and serving results through a user-friendly front-end interface. The project consists of three main components: the Client, Search, and Server. Check out the video for a full explanation here

Project Structure

The project is divided into the following folders:

Client: Contains the front-end code, providing a user interface similar to Google search, where users can enter queries and view search results.
Search: Contains the core components of Sloth Search, which replicate the three main parts of Google:
- Crawling: The web crawler that collects information from the web.
- Indexing: Processing and storing the content collected by the crawler for efficient searching.
- Serving (PageRank): Serving search results based on their relevance and PageRank algorithm.
Server: Contains the search API used to handle client requests and provide search results.

Installation and Setup

Clone the Repository

git clone <repository-url>
cd sloth-search

Install the necessary Python dependencies, run:

pip install -r requirements.txt

Client Setup
- The client contains the HTML, CSS, and JavaScript code to run the front-end.
- Open the index.html file in your browser, or use a static file server to serve the client code locally.
- You can also use the live server extension.
Search Setup

The Search directory contains the code for crawling, indexing, and serving.

You can start the process by running:

python search/complete_exmaples/advanced_pagerank.py

This will crawl, index, and prepare the content for searching.
If you want to run any other files do the same process:

python search/<path to file you want to run>

Search Setup
- The server uses Flask to provide an API for search queries.
- Start the Flask server by navigating to the Server directory and running:
```
python google_search_api.py
```

How It Works

Crawling
- The crawler starts with a set of seed URLs and collects links and content from the web.
- It respects robots.txt to avoid being blocked and to ensure ethical crawling.
- Parsed data is stored in a format ready for indexing.
Indexing
- The indexing module processes the crawled pages.
- The content is tokenized, cleaned, stemmed, and stop words are removed using the NLTK library.
- The resulting indexed data is saved to be used by the search API.
Serving and PageRank
- The PageRank algorithm is used to rank pages based on their importance.
- When a user searches for a query through the client, the server uses the indexed data and PageRank scores to return the most relevant pages.

Important Notes

Respecting Websites: The crawler respects robots.txt rules. Please make sure not to overload any websites.
PageRank Algorithm: The implementation of the PageRank algorithm uses an iterative approach to rank pages based on the links.
Data Storage: The crawler and indexer use CSV files for data storage (advanced_pagerank_inverted_index.csv and advanced_pagerank.csv). Make sure these files are writable during execution.

Contributing

Contributions are welcome! If you'd like to contribute to the development of Sloth Search, feel free to fork the repository, make changes, and submit a pull request.

License

This project is open-source and available under the MIT License.

If you have any questions or suggestions, feel free to contact me.

Happy Searching with Sloth Search! 🦥🔍

The-CodingSloth / sloth-search

readme