Fixes skipping pages accessed with ?p=

c4software / python-sitemap

Mini website crawler to make sitemap from a website.

GNU General Public License v3.0

362 stars 110 forks source link

Fixes skipping pages accessed with ?p= #70

Closed reuning closed 3 years ago

reuning commented 3 years ago

This might be a non-problem as I am basically using this code to grab all the webpages of a site and not concerned with a sitemap. But wordpress commonly uses queries in their URL to access different pages: example.com/?p=1. The current code skips over all these as their path is just '/'. This checks the query part of the parsed url as well.

c4software commented 3 years ago

Thanks for your contribution 👌.

I'm curious, for what kind of purpose do you use this tool ?

reuning commented 3 years ago

I've reworked the code a bit so that I can grab all the external links from a set of websites (a few thousand of them). I'm an academic and working on a project looking at the networks that political websites form (who they link to in common). I also am downloading the html for the websites as well so I have a record of the content.

There might be better ways to do it, but your code was a reasonable starting point.