Future work section: Uses the The Simplecrawler stores the whole header of the web request, even though only the URLs are needed to determine the similarity of two web pages. But we use the hash only.
Show the threads and nodes affect on a number of pages.
Talk about my hardware and OS used for testing.
Crawlers
Testing page
Stack Overflow
Uni ranking
Douglas
Flaconi
Check that the number of pages is correct
Efficiency by looking at links and documents found
Coverage check against links set
Check the threads and how many documents they fetched with bars
Scalability: how it is easy to scale this.
Different languages
Coverage: the percentage of relevant pages that the crawler can discover and download from the web1.
Freshness: the degree to which the crawler can keep up with the changes and updates of the web pages.
Quality: the relevance and importance of the pages that the crawler selects for downloading.
Scalability: the ability of the crawler to handle large-scale and distributed crawling tasks efficiently and robustly.
Politeness: the extent to which the crawler respects the rules and policies of the web servers and avoids overloading them.
To measure these metrics, one can use various methods such as:
Benchmarks: using a predefined set of web pages or domains as a reference for evaluating the crawler’s performance.
Simulations: using a synthetic or sampled web graph to model the structure and dynamics of the web and test the crawler’s behavior.
Experiments: running the crawler on a real or partial web and collecting data on its actions and outcomes.
Crawlers
Coverage: the percentage of relevant pages that the crawler can discover and download from the web1. Freshness: the degree to which the crawler can keep up with the changes and updates of the web pages. Quality: the relevance and importance of the pages that the crawler selects for downloading. Scalability: the ability of the crawler to handle large-scale and distributed crawling tasks efficiently and robustly. Politeness: the extent to which the crawler respects the rules and policies of the web servers and avoids overloading them. To measure these metrics, one can use various methods such as:
Benchmarks: using a predefined set of web pages or domains as a reference for evaluating the crawler’s performance. Simulations: using a synthetic or sampled web graph to model the structure and dynamics of the web and test the crawler’s behavior. Experiments: running the crawler on a real or partial web and collecting data on its actions and outcomes.
Indexers Check movies Fuzzy search
UI