c4software / python-sitemap

Mini website crawler to make sitemap from a website.
GNU General Public License v3.0
362 stars 110 forks source link

IMG Data URI and image license #26

Closed ghost closed 7 years ago

ghost commented 7 years ago

Data URI image links gets added, but they should be left out. Those are commonly used for example for lazy loading images. The real image URLs are inside NOSCRIPT tags and they get added OK. Running: python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --images --report --parserobots Output: <image:loc>https://www.2globalnomads.info/data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7</image:loc>

A few improvement proposals

Image sitemap is the only way to tell search engines the licenses of images. Please consider adding the script an option for a site-wide license for all images. It could work like this: python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --license http://creativecommons.org/publicdomain/zero/1.0/ With the following output added inside \ after \: <image:license>http://creativecommons.org/publicdomain/zero/1.0/</image:license>

You could prettyprint the sitemap.xml a bit and add there newlines after every closing tag. That would make it a bit more human readable.

If you want, you can also take \ from TITLE and/or ALT and \ from FIGCAPTION tags if they are present.

Cheers, Santeri

c4software commented 7 years ago

Nice catch.

The data: scheme is now ignored.

I will look later for the licence and the image:title / image:caption.