Crawler leaves its original domain and goes into other websites. It should only use links in the same domain or its subdomains. Not outside websites.

ekkyarmandi / text-scraping

HTML text scraping

1 stars 2 forks source link

Crawler leaves its original domain and goes into other websites. It should only use links in the same domain or its subdomains. Not outside websites. #2

Closed why-not closed 2 years ago

why-not commented 2 years ago

For url = "https://email.gov.in/"

It gives the following output which shows the issue:

Next URL: https://email.gov.in# Words: 2583 Next URL: https://email.gov.in/videos/docs/How-To-Use-Kavach.pdf Words: 2801 Next URL: https://eforms.nic.in/update-mobile Words: 3210 Next URL: https://apps.apple.com/in/app/kavach-authentication/id1227301621 Words: 3679 Next URL: https://play.google.com/store/apps/details?id=com.gov.in&hl=en_IN&gl=US

ekkyarmandi commented 2 years ago

Fixed. I have put a self.allowed variable to filter only links within "/" inside that will be crawl next