c4software / python-sitemap

Mini website crawler to make sitemap from a website.
GNU General Public License v3.0
366 stars 110 forks source link

Suggestion: Not parseable resources ->parseable resources #40

Open ghost opened 7 years ago

ghost commented 7 years ago

I took a peak at your source code. One source for crawling issues is that you currently define in the code not_parseable_ressources. Instead, if you define parseable resources and limit those to only truly parseable resources that are are supported in the sitemap and may contain plain text html links, you can limit issues with unknown extensions. Also you might take a look at using mime types instead of file extensions. I am not sure how that works in Python though.

c4software commented 7 years ago

The not parseable ressource is more like a self guard to avoid some nasty case. But, its not a bad idea i will take a look.