Relative URLs are parsed incorrectly

c4software / python-sitemap

Mini website crawler to make sitemap from a website.

GNU General Public License v3.0

362 stars 110 forks source link

Relative URLs are parsed incorrectly #48

Open ghost opened 6 years ago

ghost commented 6 years ago

If http://domain/dir/page1.html contains a link to page2.html the parser interprets this as http://domain/page2.html, correct is http://domain/dir/page2.html.

Furthermore on a page containing references to the upper directories (..), these are changed to . by self.clean_link.

I recommend to use urllib.parse.urljoin(crawling_url, link) to make a link to an absolute URL. This will handle everything except "//" in the path.

ghost commented 6 years ago

Found another program, that suits my needs.

Nevertheless thanks to @c4software for this nice piece of software.

c4software commented 6 years ago

Sorry i didn't answer quickly to your issue…

I will fix asap the problem you have found