Sitemap feature breaks with non-ANSI characters in image URLs

randombumper commented 10 years ago

XML in sitemap tries to convert image URLS like http://example.com/limón.jpg to http://example.com/limón.jpg. Since ó and the like are not valid XML entities, the parser breaks.

I worked around it by changing the plugin code. In version 1.4.24 (current), file class-sitemaps.php, line 846:

esc_html( $img['src'] )

becomes

esc_html( html_entity_decode( $img['src'], ENT_QUOTES, get_bloginfo( 'charset' ) ) )

Everything seems to be working now. Any feedback?

Thanks,

Pablo.

barrykooij commented 10 years ago

Thanks for reporting this, I will have a look at this in the next few days.

jrfnl commented 10 years ago

@barrykooij you may want to have a look here for some useful info about these kind of issues (and possible solution direction).

jdevalk commented 10 years ago

The Google doc for this: https://support.google.com/webmasters/answer/35653?hl=en

jrfnl commented 10 years ago

As the doc says: the url should be should be encoded not entity escaped, so esc_url() would be the way to go ;-)

jdevalk commented 10 years ago

Unfortunately, esc_url() doesn't do anything when you run it on http://example.com/limón.jpg. I think we'll need to write code that encodes the special chars...

jrfnl commented 10 years ago

I'm leaning towards submitting a patch to WP core for an esc_xml() function for the entity escaping for xml - should be used for output to RSS feeds as well after all. Internally we could implement the same function with a function_exists() around it to accommodate WP versions from before the patch.

Quite apart from that we should urlencode the urls in the sitemap. In that respect: in the VideoSEO detail retrieval abstract class, there is a good example of a solid urlencoding function which I added a while back. Something along those lines would/should work.

jdevalk commented 10 years ago

Hmm, any chance you could take the lead on this one @jrfnl ? No rush as far as I'm concerned as it's relatively "niche" for now.

jrfnl commented 10 years ago

It's been on my list for months already, just haven't gotten round to it yet ;-)

tristanleboss commented 9 years ago

It would be nice if the fix provided by randombumper could be committed. Indeed, it does solve the problem and I see no reason why it would create more troubles: I tried a few use cases and found no regression.

What I mean is, without this fix, XML sitemaps are just plain invalid for Google (I discovered that in the Webmaster Tools)... even if the plugin does generate them "successfully". So, it really can't be worse!

Anyway, the HTML entities are totally banned from XML: they should not be in it at any price. That's perfect because html_entity_decode does exactly the job of converting them back to the correct character with the correct encoding they should have been encoded to (get_bloginfo( 'charset' )).

The fact that some filenames contain html entities is just due to the TinyMCE editor: I looked directly at the post_content field in the database and, even if the field is using the correct collation utf8mb4, even if the filenames are UTF-8 on my server (I double checked that), some filenames are using html entities and some are not. I can even have both filenames with and without html entities in the same post! Probably because some posts have been edited over various browsers and various WordPress versions.

string(136) "http://lightzoomlumiere.fr/wp-content/uploads/2014/05/Le-roi-se-meurt-Compagnie-Biloxi-48-Bruxelles-Belgique-Photo-Nathalie-Borlée-2.jpg"

string(137) "http://lightzoomlumiere.fr/wp-content/uploads/2014/05/Le-roi-se-meurt-Compagnie-Biloxi-48-Bruxelles-Belgique-Photo-Nathalie-BorlÃ©e-2.jpg"

<?php

// Latin-1 WITHOUT html entity
// Return: as-is, untouched
var_dump( html_entity_decode('http://lightzoomlumiere.fr/wp-content/uploads/2014/05/Le-roi-se-meurt-Compagnie-Biloxi-48-Bruxelles-Belgique-Photo-Nathalie-Borlée-2.jpg', ENT_QUOTES, 'UTF-8') );

// UTF-8 string WITHOUT html entity (Latin-1 version encoded with utf8_encode
// Return: as-is, untouched
var_dump( html_entity_decode(utf8_encode('http://lightzoomlumiere.fr/wp-content/uploads/2014/05/Le-roi-se-meurt-Compagnie-Biloxi-48-Bruxelles-Belgique-Photo-Nathalie-Borlée-2.jpg'), ENT_QUOTES, 'UTF-8') );

// Latin-1 WITH HTML entity
// Return: UTF-8 string, bynary equal to the second example so the job was correctly done
var_dump( html_entity_decode('http://lightzoomlumiere.fr/wp-content/uploads/2014/05/Le-roi-se-meurt-Compagnie-Biloxi-48-Bruxelles-Belgique-Photo-Nathalie-Borl&eacute;e-2.jpg', ENT_QUOTES, 'UTF-8') );

// UTF-8 WITH HTML entity (Latin-1 version encoded with utf8_encode)
// Return: UTF-8 string, bynary equal to the second example so the job was correctly done
var_dump( esc_url(html_entity_decode(utf8_encode('http://lightzoomlumiere.fr/wp-content/uploads/2014/05/Le-roi-se-meurt-Compagnie-Biloxi-48-Bruxelles-Belgique-Photo-Nathalie-Borl&eacute;e-2.jpg'), ENT_QUOTES, 'UTF-8')) );

Rarst commented 8 years ago

Google has a new version of the document https://support.google.com/webmasters/answer/183668?hl=en&rd=1 which confusingly refers to the old version for escaping, which redirects to a new version.

Frankly this lost me completely as to how exactly they expect UTF-8 but with only ASCII characters allowed.

The code suggested in original post seems to result in UTF-8 (given UTF-8 in WP), but not ASCII.

Rarst commented 8 years ago

So far the closest I figured out to RFC 3986 they name is PHP's rawurlencode(), but the problem is it's meant to be run on parts of URL, not the whole URL (in which case it messes up slashes in protocol and stuff).

Rarst commented 8 years ago

I asked around and @Giuseppe-Mazzapica shared formatUrl() function he is using for this issue. It indeed involves tearing URL apart completely and reprocessing each piece.

Keeping this in mind, but I am reluctant to just drop this in. There should probably be related prep work for sitemaps to be always output in UTF-8 (I think we have that noted somewhere) and a little concerned about running it on every single link in large sitemaps. Maybe we could add detection step if the URL needs encoding.

gmazzap commented 8 years ago

Yes, I have to say that I have that in place as part of a class where other method ensures urls are UTF-8. Moreover, there sitemaps are generated in a CLI context that runs over cron (Unix cron, not WP Cron) so I had very little concern about performance.

Rarst commented 8 years ago

Leaving as a note for myself FILTER_VALIDATE_URL will check for ASCII only symbols so it can be used to check if URL needs encoding.

Yoast / wordpress-seo

Sitemap feature breaks with non-ANSI characters in image URLs #595