Closed mnlipp closed 5 years ago
Hi,
Thanks for the pull request, its seems correct, but do you have a sample website to validate the behavior ?
Thanks for the pull request, its seems correct, but do you have a sample website to validate the behavior ?
Of course. I found the problem when I tried to index my github site.
(Make sure to use only one worker. When I tried it with 4, I got only half the entries in the sitemap. But that's a different issue and I didn't have time to look into that.)
Hi,
Seems good ! Sorry for the merge delay.
Thanks
Currently, relative URLs aren't handled correctly. This affects several locations. First, relative links have to be resolved against the URL of the crawled page (crawler.py:268). Second, the clean_link is wrong, it doesn't handle "../../.." correctly (collapsed to ./././) and third, links may not/cannot be cleaned immediately when parsed (removed call to clean_link).