c4software / python-sitemap

Mini website crawler to make sitemap from a website.
GNU General Public License v3.0
362 stars 110 forks source link

Tracker images are included #27

Closed ghost closed 7 years ago

ghost commented 7 years ago

Tracker image links gets added, but they should be left out. You could simply check that the image extension is not php or js, or that it is a valid image type, before adding it: Running: python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --images --report --parserobots Output: <image:image><image:loc>https://analytics.2globalnomads.info/piwik.php?idsite=1&amp;rec=1</image:loc>

I appears that the exclusion parameters (--skipext --exclude --drop) don't seem to have any effect to images.

c4software commented 7 years ago

Nice catch !

The next version will skip images specified by the exclude :

Example

python3 main.py --domain https://www.2globalnomads.info --images --report --parserobots --exclude "piwik.php"
c4software commented 7 years ago

Present in the latest version.