Sloth Search - A Google-like Search Engine Clone
Sloth Search is a project that aims to recreate Google, including crawling, indexing, and serving results through a user-friendly front-end interface. The project consists of three main components: the Client, Search, and Server.
Check out the video for a full explanation here
Project Structure
The project is divided into the following folders:
- Client: Contains the front-end code, providing a user interface similar to Google search, where users can enter queries and view search results.
- Search: Contains the core components of Sloth Search, which replicate the three main parts of Google:
- Crawling: The web crawler that collects information from the web.
- Indexing: Processing and storing the content collected by the crawler for efficient searching.
- Serving (PageRank): Serving search results based on their relevance and PageRank algorithm.
- Server: Contains the search API used to handle client requests and provide search results.
Installation and Setup
-
Clone the Repository
git clone <repository-url>
cd sloth-search
-
Install the necessary Python dependencies, run:
pip install -r requirements.txt
-
Client Setup
- The client contains the HTML, CSS, and JavaScript code to run the front-end.
- Open the
index.html
file in your browser, or use a static file server to serve the client code locally.
- You can also use the live server extension.
-
Search Setup
python search/<path to file you want to run>
- Search Setup
How It Works
-
Crawling
- The crawler starts with a set of seed URLs and collects links and content from the web.
- It respects
robots.txt
to avoid being blocked and to ensure ethical crawling.
- Parsed data is stored in a format ready for indexing.
-
Indexing
- The indexing module processes the crawled pages.
- The content is tokenized, cleaned, stemmed, and stop words are removed using the NLTK library.
- The resulting indexed data is saved to be used by the search API.
-
Serving and PageRank
- The PageRank algorithm is used to rank pages based on their importance.
- When a user searches for a query through the client, the server uses the indexed data and PageRank scores to return the most relevant pages.
Important Notes
- Respecting Websites: The crawler respects
robots.txt
rules. Please make sure not to overload any websites.
- PageRank Algorithm: The implementation of the PageRank algorithm uses an iterative approach to rank pages based on the links.
- Data Storage: The crawler and indexer use CSV files for data storage (
advanced_pagerank_inverted_index.csv
and advanced_pagerank.csv
). Make sure these files are writable during execution.
Contributing
Contributions are welcome! If you'd like to contribute to the development of Sloth Search, feel free to fork the repository, make changes, and submit a pull request.
License
This project is open-source and available under the MIT License.
If you have any questions or suggestions, feel free to contact me.
Happy Searching with Sloth Search! 🦥🔍