c4software / python-sitemap

Mini website crawler to make sitemap from a website.
GNU General Public License v3.0
366 stars 110 forks source link

Image Sitemap? #22

Closed wernerb90 closed 7 years ago

wernerb90 commented 7 years ago

Hi,

Would you consider adding support for images in the future?

i.e. https://support.google.com/webmasters/answer/178636?hl=en

c4software commented 7 years ago

Nice Idea.

c4software commented 7 years ago

I have added the --images flag to enable image sitemap.

Feel free to test it and make a return if you want some modification(s).

wernerb90 commented 7 years ago

@c4software Amazing, thanks, will test it out now!

wernerb90 commented 7 years ago

Hi @c4software

So I ran it on one of my sites, and seems there's a bit of a format issue.

Currently you're adding the tags to the end of the sitemap, instead they should be within each tag - i.e. you're associating the image urls to the page/url they we're scraped from.

i.e. on the link I referenced in the original post you'll see an example:

`<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">

http://example.com/sample.html http://example.com/image.jpg http://example.com/photo.jpg

`

c4software commented 7 years ago

Nice catch i miss that point!

I will look how i can achieve this (without create duplication in the sitemap, eg same image in every page)

c4software commented 7 years ago

Well, i look deeper, and image:loc should be repeat in every location if the image is present. So its more simpler than i expected.

wernerb90 commented 7 years ago

Hi @c4software

Yes - I believe the most important part is just that the images/pages are associated correctly, which shouldn't be too massive of an issue.

Would the exclude path option work for these, as then one can exclude cosmetic images by folder - i.e.

/assets/img/ - this could contain cosmetic images, icons, etc /media/ or /uploads/ - this would contain images we want to tag as images in the sitemap.

so by excluding /assets/img/ all other images would be included?

c4software commented 7 years ago

Should be better now.

c4software commented 7 years ago

eg :

[…]
    <url>
        <loc>http://blog.lesite.us/el-capitan-personnaliser-la-disposition-de-clavier-du-login-screen.html</loc>
        <lastmod>2016-07-13T14:36:32+00:00</lastmod>
        <image:image>
            <image:loc>http://blog.lesite.us/theme/images/avatar.jpg</image:loc>
        </image:image>
    </url>
[…]
wernerb90 commented 7 years ago

Hi @c4software

Looks great, thanks, will check and revert.

c4software commented 7 years ago

I will rework on this issue to read the title attribute (if present) of the image to populate sitemap with it. Its optional but it will be better i think…

wernerb90 commented 7 years ago

Hi @c4software

Looking great so far. <image:loc> is now going in to the correct place, but I noticed images are still picked up as pages as well - i.e. <url><loc> - not sure if that is accurate according to the spec?

Perhaps it's because my images has query string parameters? (images are dynamically sized, so they have "h" and "w" parameters after the filename...

c4software commented 7 years ago

Strange, do you have an example of url?

wernerb90 commented 7 years ago

Example is, i.e.

https://www.xxxxxxxxxx/imgs/files/images/a95d_1_13e7b_original.jpg?h=450&w=685&zc=1&fltr=usm|80|0.5|3

I have managed to get the desired result by adding the --exclude "/imgs/" parameter, as these images will always be under this path. i noticed exclude doesn't get checked for the --images flag, which is perfect for my use case.

c4software commented 7 years ago

Strange... Image link are referenced inside a a element ?

wernerb90 commented 7 years ago

Hi @c4software

No - it's just div -> div -> div -> img

c4software commented 7 years ago

Hi,

Thanks for the feedback. I can’t reproduce the behavior, but i have made some modification. Can you test it again ?

wernerb90 commented 7 years ago

Hi @c4software

THanks, will test the update. Some further feedback - the image sitemap failed validation with Google Webmaster Tools - this is because of the ampersands in my image URLs, they need to be escaped and presented as "&" .... or to wrap that URL in CDATA tags.

A stackoverflow reference for the error is below, as it's just xml parsing in this instance. http://stackoverflow.com/questions/23422316/xml-validation-error-entityref-expecting

c4software commented 7 years ago

Ah yes you're right. I made the first version to fast...

c4software commented 7 years ago

I will make the modification tomorrow morning. I just forgot case like yours

c4software commented 7 years ago

Hi,

Should be better now.

c4software commented 7 years ago

For me its fixed.

Feel free to reopen if you find any more related issue.