convert_jpeg_to_webp not work with cyrillic url

poiuty commented 9 years ago

/upload/123123sfsdf.jpg => work get => /upload/123123sfsdf.jpg.pagespeed.ic.hae-ZFT29R.webp

/upload/тест.jpg => not work get => /upload/тест.jpg without convert_jpeg_to_webp

oschaaf commented 9 years ago

@poiuty I don't think that PageSpeed's underlying html parser currently (fully) supports cyrillic, so optimizing resources referenced in html via urls containing cyrillic characters will not happen right now.

The good news is that In-place optimization [1] for those resources with cyrillic names will work. [1] https://developers.google.com/speed/pagespeed/module/system#ipro

jeffkaufman commented 9 years ago

Character encodings are a bit of a mess: the encoding metadata we see may not be the metadata the browser sees. I believe what we do is work on anything where characters below 127 are ascii, which includes ascii utf8. Supporting utf8 urls, though, might not be too bad an extension? This would be nice to add.

But yes, as @oschaaf says, this isn't a misconfiguration on your part, it's a PageSpeed limitation.

jmarantz commented 9 years ago

Clarification: The html parser tries to be encoding-agnostic so it can work with markup in Russian and Chinese charsets without transcoding to utf8, which would take cpu, increase latency, and possibly increase data size served to the browser.

The limitation on this system is that we cannot reliably manipulate non-ascii URLs and send them into an http fetch system, so right now we don't try, even for utf8.

We could, without changing the lexer, try to transcode URLs from whatever charset we think they are in into utf8, and then reencode them into their origin charset when we put the altered URLs back into html.

There are TODOs in the code for this. But we'd have to link in a large i18n library for the charset support, and depend on the http headers and meta tags to tell us what the document encoding is. The risk is that the charset headers might be added to the html response heaters downstream of pagespeed. So we are not sure if we can correctly determine the source charset.

Independent of the challenges, this hasn't popped to the top of the priority queue for our small team yet.

So in the meantime, pagespeed can safely rewrite html in ASCII, utf8, Chinese and Russian charsets. However it will simply avoid touching URLs with non-ascii characters. As Jeff said, these assets can still be optimized in-place, which is on by default in 1.9.

Maybe we could try to special case utf8 documents and allow URL rewriting in those, without transcoding charsets?

Josh On Oct 16, 2015 11:45 AM, "Jeff Kaufman" notifications@github.com wrote:

Character encodings are a bit of a mess: the encoding metadata we see may not be the metadata the browser sees. I believe what we do is work on anything where characters below 127 are ascii, which includes ascii utf8. Supporting utf8 urls, though, might not be too bad an extension? This would be nice to add.

But yes, as @oschaaf https://github.com/oschaaf says, this isn't a misconfiguration on your part, it's a PageSpeed limitation.

— Reply to this email directly or view it on GitHub https://github.com/pagespeed/ngx_pagespeed/issues/1023#issuecomment-148751045 .

jeffkaufman commented 9 years ago

Would it work to stick with the current system up to identifying the bytes corresponding to the url, and them fetch that over the network as-is (well, with url-encoding)? And then if that works append ".pagespeed.FILTER.HASH.EXT" (as usual) to the bytestream?

The idea is, I'm not sure we actually need to know what the encoding of the page is, as long as it's one of the ones where printable ascii characters are represented with ascii bytes.

oschaaf commented 8 years ago

Looks like another report for this issue: https://groups.google.com/forum/#!topic/ngx-pagespeed-discuss/kiCujxpzQaE

behnoodk commented 8 years ago

Is there any way to know if development team has any plans to fix/implement this in near future releases? our site is getting ready for production and it would great if we knew whether we will be able to use pagespeed or not? we have Arabic/Persian characters in almost all our URLs, which makes pagespeed unusable for us right now.

jmarantz commented 8 years ago

@Jeff "would it work to take the bytes from the HTML attribute value and send them to the network"? Not in all cases; there can be HTML escapes in there that would need to be decoded before fetching. E.g. "foo.jpg&attr=value". BTW the encoding in the .pagespeed. URL is not a problem; we can easily comma-escape any non-ascii characters. The problem is when we try to initiate the HTTP fetch.

That also means that (I believe) we can today compress/minify URLs with non-ascii characters in them using the in-place flow. This is because we are not trying to extract a URL from HTML and formulate a request, but are instead just collecting the bytes as they pass through Apache.

@Behnood: I think this is a tractable feature request for utf-8. Are your Arabic/Persian characters encoded in utf-8?

-Josh

On Tue, Dec 8, 2015 at 6:54 AM, Behnood Khani notifications@github.com wrote:

Is there any way to know if development team has any plans to fix/implement this in near future releases? our site is getting ready for production and it would great if we knew whether we will be able to use pagespeed or not? we have Arabic/Persian characters in almost all our URLs, which makes pagespeed unusable for us right now.

— Reply to this email directly or view it on GitHub https://github.com/pagespeed/ngx_pagespeed/issues/1023#issuecomment-162860465 .

behnoodk commented 8 years ago

@jmarantz yes, characters that build up a URL are retrieved from db which is utf-8 encoded. I hope I understood your question correctly.

jeffkaufman commented 8 years ago

Added to https://github.com/pagespeed/mod_pagespeed/wiki/Work-Prioritization

jmarantz commented 8 years ago

Yeah I think it makes sense in particular to work on this for utf-8 first. I'm still not 100% sure what we have to do to utf-8 encoded URLs before passing them to the Serf fetcher.

On Wed, Dec 9, 2015 at 3:35 PM, Jeff Kaufman notifications@github.com wrote:

Added to https://github.com/pagespeed/mod_pagespeed/wiki/Work-Prioritization

— Reply to this email directly or view it on GitHub https://github.com/pagespeed/ngx_pagespeed/issues/1023#issuecomment-163384571 .

apache / incubator-pagespeed-ngx

convert_jpeg_to_webp not work with cyrillic url #1023