Chapter 5 Evaluation - Githubissues

This is similar master thesis: Link
- They use the settings file but we use UI
- Future work section: Uses the The Simplecrawler stores the whole header of the web request, even though only the URLs are needed to determine the similarity of two web pages. But we use the hash only.
Show the threads and nodes affect on a number of pages.
Talk about my hardware and OS used for testing.

Crawlers

Testing page
Stack Overflow
Uni ranking
Douglas
Flaconi Check that the number of pages is correct Efficiency by looking at links and documents found Coverage check against links set Check the threads and how many documents they fetched with bars Scalability: how it is easy to scale this. Different languages

Coverage: the percentage of relevant pages that the crawler can discover and download from the web 1. Freshness: the degree to which the crawler can keep up with the changes and updates of the web pages. Quality: the relevance and importance of the pages that the crawler selects for downloading. Scalability: the ability of the crawler to handle large-scale and distributed crawling tasks efficiently and robustly. Politeness: the extent to which the crawler respects the rules and policies of the web servers and avoids overloading them. To measure these metrics, one can use various methods such as:

Benchmarks: using a predefined set of web pages or domains as a reference for evaluating the crawler’s performance. Simulations: using a synthetic or sampled web graph to model the structure and dynamics of the web and test the crawler’s behavior. Experiments: running the crawler on a real or partial web and collecting data on its actions and outcomes.

Indexers Check movies Fuzzy search

Alhajras / webscraper

Chapter 5 Evaluation #31