ezyang / htmlpurifier

Standards compliant HTML filter written in PHP
http://htmlpurifier.org
GNU Lesser General Public License v2.1
3.03k stars 323 forks source link

Fix PHP 8.2 deprecated utf8_encode & utf8_decode #324

Closed SharkMachine closed 1 year ago

SharkMachine commented 2 years ago

Removed usages of deprecated functions utf8_decode & utf8_encode

This definitely isn't the way to solve the issue in the long run, but would at least keep backwards compatibility with minimal changes apart from dropping PHP 5.2 support. Removing utf8_encode and utf8_decode completely would require quite a bit of changes to related tests and would also break backwards compatibility.

ezyang commented 2 years ago

Where can I read about the deprecation? It doesn't appear to be on the docs: https://www.php.net/manual/en/function.utf8-decode.php

I'm generally not so hot on adding a polyfill dep. I'd rather just do a function_exists test and not bother with support if it isn't installed. This should also prevent the need for test updates.

bytestream commented 2 years ago

https://wiki.php.net/rfc/remove_utf8_decode_and_utf8_encode

ezyang commented 2 years ago

OK, so this suggests using mb_convert_encoding, can we use that instead?

bytestream commented 2 years ago

We could but that requires ext-mbstring which is currently not used / required. It's also not bundled in PHP by default - https://www.php.net/manual/en/mbstring.installation.php

ezyang commented 2 years ago

It looks like UConverter is bundled by default in php 5.3 or later so that would be a good pick too

zerocrates commented 2 years ago

ext/intl is bundled since 5.3 but I believe UConverter wasn't added until later: docs indicate 5.5. Also, I think the reference to intl being "bundled" means only that its source is shipped with PHP's (it was previously a PECL extension), not that it's built by default. That's just to say that mbstring is also a "bundled" extension in the same sense that intl is: both are included in the PHP source distribution and both are not built by default. Actually, ext/intl has an external dependency on ICU, whereas PHP directly includes the dependency library for ext/mbstring, so intl is a little "less bundled" than mbstring is.

utf8_encode and _decode are very simple functions since they only support going to/from ISO 8559-1, so you could also feasibly just directly include substitute methods for them... they'd come to something in the tens of lines.

ezyang commented 2 years ago

ok if someone wants to copy paste in the polyfill that's cool too

bytestream commented 2 years ago

I think the simplest solution is to keep using utf8_*() functions on PHP 5.2 - 8.1, then try to use UConverter or some polyfill on PHP 8.2+. From https://www.php.net/manual/en/intl.installation.php it looks to me like UConverter is bundled and enabled by default in PHP 5.5+

It's also worth noting that the utf8_*() functions should very rarely be used, http://htmlpurifier.org/live/configdoc/plain.html#Core.Encoding defaults to utf-8 and notably utf8_*() only run if iso-8859-1 is set and ext-iconv isn't enabled:

This directive only accepts ISO-8859-1 if iconv is not enabled.

ezyang commented 2 years ago

yeah I'm also ok with dumping this support too without iconv

SharkMachine commented 2 years ago

I tried replacing utf8_with mbstring and iconv, it didn't work. utf8_ works differently

Edit: that change caused unit tests to fail

SharkMachine commented 1 year ago

Looks like this was solved by #326, big thanks to @zerocrates.