c4software / python-sitemap

Mini website crawler to make sitemap from a website.
GNU General Public License v3.0
366 stars 110 forks source link

Fixed handling of relative URLs #56

Closed mnlipp closed 5 years ago

mnlipp commented 5 years ago

Currently, relative URLs aren't handled correctly. This affects several locations. First, relative links have to be resolved against the URL of the crawled page (crawler.py:268). Second, the clean_link is wrong, it doesn't handle "../../.." correctly (collapsed to ./././) and third, links may not/cannot be cleaned immediately when parsed (removed call to clean_link).

c4software commented 5 years ago

Hi,

Thanks for the pull request, its seems correct, but do you have a sample website to validate the behavior ?

mnlipp commented 5 years ago

Thanks for the pull request, its seems correct, but do you have a sample website to validate the behavior ?

Of course. I found the problem when I tried to index my github site.

(Make sure to use only one worker. When I tried it with 4, I got only half the entries in the sitemap. But that's a different issue and I didn't have time to look into that.)

c4software commented 5 years ago

Hi,

Seems good ! Sorry for the merge delay.

Thanks