Roshdy23 / Playmaker

Playmaker is Crawler-based search engine that demonstrates the main features of a search engine (web crawling, indexing and ranking) and the interaction with it along a friendly user interface.
3 stars 1 forks source link

Crawler Stop Working Without Throwing an Exception #13

Closed AbdallahSalah003 closed 2 months ago

AbdallahSalah003 commented 2 months ago

When I set the maxDepth variable to 100 and update the Seed.txt list with the following list :

https://www.goal.com/en https://www.sofascore.com/ https://www.marca.com/en/?intcmp=BOTONPORTADA&s_kw=portada https://www.thesun.co.uk/sport/football/ https://www.realmadrid.com/en-US https://www.manutd.com/ https://www.liverpoolfc.com/ https://www.bbc.com/sport/football https://www.skysports.com/ https://www.bleacherreport.com/world-football https://www.football365.com/ https://www.fourfourtwo.com/ https://www.theguardian.com/football https://fcbayern.com/en https://www.mancity.com/ https://www.fcbarcelona.com/en/ https://www.fifa.com https://www.uefa.com https://www.premierleague.com https://www.laliga.com https://www.bundesliga.com https://www.transfermarkt.com https://www.ligue1.com

The Crawler stop at a random point each time without throwing an exception, logging to console Error in robots.txt: then a message. Then Stop working without logging any thing, I wait around 15 minutes and nothing is logged to the console or any exception. I've changed the Seed.txt and It seems working fine, So I think there are some websites make the crawler not working or stop doing its job related to robots.txt file.

Also when increasing the maxDepth variable to 100 the crawler also stop working without throwing any exception.

Another thing to consider (I Don't Know if it's already handled) that there must be stop conditions on crawling. When the crawler start to crawl Goal.com it crawl without stopping and no other links in Seed.txt is Crawled sot I had to stop it to prevent making all crawled websites from the same original link (Goal.com).

Our goal is to crawl around 6000 page so we need to solve this issue to make the Seed.txt large as possible.

I may mis-understood something so If any one has a comment please let me know.

Thank you.

Roshdy23 commented 2 months ago

Thanks for your cooperation. issue fixed.