Closed randombumper closed 8 years ago
Thanks for reporting this, I will have a look at this in the next few days.
@barrykooij you may want to have a look here for some useful info about these kind of issues (and possible solution direction).
The Google doc for this: https://support.google.com/webmasters/answer/35653?hl=en
As the doc says: the url should be should be encoded not entity escaped, so esc_url()
would be the way to go ;-)
Unfortunately, esc_url()
doesn't do anything when you run it on http://example.com/limón.jpg
. I think we'll need to write code that encodes the special chars...
I'm leaning towards submitting a patch to WP core for an esc_xml()
function for the entity escaping for xml - should be used for output to RSS feeds as well after all.
Internally we could implement the same function with a function_exists()
around it to accommodate WP versions from before the patch.
Quite apart from that we should urlencode the urls in the sitemap. In that respect: in the VideoSEO detail retrieval abstract class, there is a good example of a solid urlencoding function which I added a while back. Something along those lines would/should work.
Hmm, any chance you could take the lead on this one @jrfnl ? No rush as far as I'm concerned as it's relatively "niche" for now.
It's been on my list for months already, just haven't gotten round to it yet ;-)
It would be nice if the fix provided by randombumper could be committed. Indeed, it does solve the problem and I see no reason why it would create more troubles: I tried a few use cases and found no regression.
What I mean is, without this fix, XML sitemaps are just plain invalid for Google (I discovered that in the Webmaster Tools)... even if the plugin does generate them "successfully". So, it really can't be worse!
Anyway, the HTML entities are totally banned from XML: they should not be in it at any price. That's perfect because html_entity_decode
does exactly the job of converting them back to the correct character with the correct encoding they should have been encoded to (get_bloginfo( 'charset' )
).
The fact that some filenames contain html entities is just due to the TinyMCE editor: I looked directly at the post_content field in the database and, even if the field is using the correct collation utf8mb4
, even if the filenames are UTF-8 on my server (I double checked that), some filenames are using html entities and some are not. I can even have both filenames with and without html entities in the same post! Probably because some posts have been edited over various browsers and various WordPress versions.
<?php
// Latin-1 WITHOUT html entity
// Return: as-is, untouched
var_dump( html_entity_decode('http://lightzoomlumiere.fr/wp-content/uploads/2014/05/Le-roi-se-meurt-Compagnie-Biloxi-48-Bruxelles-Belgique-Photo-Nathalie-Borlée-2.jpg', ENT_QUOTES, 'UTF-8') );
// UTF-8 string WITHOUT html entity (Latin-1 version encoded with utf8_encode
// Return: as-is, untouched
var_dump( html_entity_decode(utf8_encode('http://lightzoomlumiere.fr/wp-content/uploads/2014/05/Le-roi-se-meurt-Compagnie-Biloxi-48-Bruxelles-Belgique-Photo-Nathalie-Borlée-2.jpg'), ENT_QUOTES, 'UTF-8') );
// Latin-1 WITH HTML entity
// Return: UTF-8 string, bynary equal to the second example so the job was correctly done
var_dump( html_entity_decode('http://lightzoomlumiere.fr/wp-content/uploads/2014/05/Le-roi-se-meurt-Compagnie-Biloxi-48-Bruxelles-Belgique-Photo-Nathalie-Borlée-2.jpg', ENT_QUOTES, 'UTF-8') );
// UTF-8 WITH HTML entity (Latin-1 version encoded with utf8_encode)
// Return: UTF-8 string, bynary equal to the second example so the job was correctly done
var_dump( esc_url(html_entity_decode(utf8_encode('http://lightzoomlumiere.fr/wp-content/uploads/2014/05/Le-roi-se-meurt-Compagnie-Biloxi-48-Bruxelles-Belgique-Photo-Nathalie-Borlée-2.jpg'), ENT_QUOTES, 'UTF-8')) );
Google has a new version of the document https://support.google.com/webmasters/answer/183668?hl=en&rd=1 which confusingly refers to the old version for escaping, which redirects to a new version.
Frankly this lost me completely as to how exactly they expect UTF-8 but with only ASCII characters allowed.
The code suggested in original post seems to result in UTF-8 (given UTF-8 in WP), but not ASCII.
So far the closest I figured out to RFC 3986 they name is PHP's rawurlencode()
, but the problem is it's meant to be run on parts of URL, not the whole URL (in which case it messes up slashes in protocol and stuff).
I asked around and @Giuseppe-Mazzapica shared formatUrl()
function he is using for this issue. It indeed involves tearing URL apart completely and reprocessing each piece.
Keeping this in mind, but I am reluctant to just drop this in. There should probably be related prep work for sitemaps to be always output in UTF-8 (I think we have that noted somewhere) and a little concerned about running it on every single link in large sitemaps. Maybe we could add detection step if the URL needs encoding.
Yes, I have to say that I have that in place as part of a class where other method ensures urls are UTF-8. Moreover, there sitemaps are generated in a CLI context that runs over cron (Unix cron, not WP Cron) so I had very little concern about performance.
Leaving as a note for myself FILTER_VALIDATE_URL
will check for ASCII only symbols so it can be used to check if URL needs encoding.
XML in sitemap tries to convert image URLS like http://example.com/limón.jpg to http://example.com/limón.jpg. Since ó and the like are not valid XML entities, the parser breaks.
I worked around it by changing the plugin code. In version 1.4.24 (current), file class-sitemaps.php, line 846:
becomes
Everything seems to be working now. Any feedback?
Thanks,
Pablo.