Closed reuning closed 3 years ago
Thanks for your contribution 👌.
I'm curious, for what kind of purpose do you use this tool ?
I've reworked the code a bit so that I can grab all the external links from a set of websites (a few thousand of them). I'm an academic and working on a project looking at the networks that political websites form (who they link to in common). I also am downloading the html for the websites as well so I have a record of the content.
There might be better ways to do it, but your code was a reasonable starting point.
This might be a non-problem as I am basically using this code to grab all the webpages of a site and not concerned with a sitemap. But wordpress commonly uses queries in their URL to access different pages:
example.com/?p=1
. The current code skips over all these as their path is just '/'. This checks the query part of the parsed url as well.