v2.0.0 data corruption in non-UTF-8 apps as a result of dependency voku/portable-utf8 autoloader setting default_charset and mb_internal_encoding to "UTF-8" #4022
- [* ] a bug report
- [ ] a feature request
- [ ] **not** a usage question (ask them on https://stackoverflow.com/questions/tagged/phpspreadsheet or https://gitter.im/PHPOffice/PhpSpreadsheet)
What is the expected behavior?
That no packages change global settings of any kind.
What is the current behavior?
That mb_internal_encoding is being set to UTF-8 by dependency voku/portable-utf8/src/voku/helper/UTF8.php
What are the steps to reproduce?
Set mb_internal_encoding() to anything but UTF-8, let the composer autoloader do it's thing, and then check mb_internal_encoding().
It should still be set to what you set it to, but you'll see that it has been changed to UTF-8.
You don't even have to load/use any PhpOffice code first.
<?php
mb_internal_encoding('cp1252');
error_log('before autoload: ' . mb_internal_encoding());
require __DIR__ . '/vendor/autoload.php';
error_log('after autoload: ' . mb_internal_encoding());
# what you actually see:
# before autoload: Windows-1252
# after autoload: UTF-8
# what you should see:
# before autoload: Windows-1252
# after autoload: Windows-1252
If this is an issue with reading a specific spreadsheet file, then it may be appropriate to provide a sample file that demonstrates the problem; but please keep it as small as possible, and sanitize any confidential information before uploading.
What features do you think are causing the issue
The composer autoloader calls voku/portable-utf8/bootstrap.php, which in turn calls \voku\helper\UTF8::checkForSupport(); which calls mb_internal_encoding('UTF-8'); Also \voku\helper\Bootstrap.php performs an ini_set('default_charset', 'UTF-8') call.
To be clear, one doesn't even have to call any PhpOffice or voku code to experience this issue. Having either phpoffice/phpspreadsheet or voku/portable-utf8 present in composer.json is sufficient.
There is only 1 line in the whole phpoffice/phpspreadsheet project that requires the large voku dependency for cleaning xss from text when writing a worksheet comment. Surely there is a better way to filter xss via one of the many other html / xss filtering packages. Or if it were up to me, just leave it up to the responsibility of the caller to clean xss if and how they want to. I think that's the more hassle-free approach.
Does an issue affect all spreadsheet file formats? If not, which formats are affected?
Yes and it affects much more than that too: the whole application and all data passing through it.
Which versions of PhpSpreadsheet and PHP are affected?
2.0.0
Background info
Recently after upgrading some packages, some database data corruption issues started popping up in an application that had been running perfectly fine for ages.
The application initializes various global settings at start up, such as the timezone, mb_internal_encoding(), the ini setting default_charset, etc. The database connection too sets the connection encoding and collation appropriately based on the mb_internal_encoding().
After spending time debugging I discovered that the database connection charset was being set to UTF-8, instead of latin1.
Then I found the culprits: voku/portable-utf8/src/voku/helper/UTF8.php and voku/portable-utf8/src/voku/helper/Boostrap.php
They blatantly call mb_internal_encoding('UTF-8') and ini_set('default_charset', 'UTF-8') during the autoloader phase.
That is really bad practice imho, so I'd recommend removing/replacing that dependency all together. I just can't trust a package like that. Who know's what other global settings it changes.
I have no choice now but to roll back all updates of phpoffice/phpspreadsheet 2.0 to the latest 1.* in every single application that I've updated.
I could alternatively call mb_internal_encoding() again after calling the autoloader, but that could lead to unpredictable functionality in both voku/portable-utf8 and phpoffice/phpspreadsheet.
Update, TL;DR; I just found a recent unreleased commit that fixed the issue I'm having: https://github.com/PHPOffice/PhpSpreadsheet/commit/f7cf378faed2e11cf4825bf8bafea4922ae44667 which lead me to it's ticket: https://github.com/PHPOffice/PhpSpreadsheet/pull/3957 which lead me to another ticket having similar problems with the voku dependency: https://github.com/PHPOffice/PhpSpreadsheet/issues/3954 which makes me wish composer had a package blacklist feature to prevent rogue packages from creeping in as dependencies.
This is:
What is the expected behavior?
That no packages change global settings of any kind.
What is the current behavior?
That mb_internal_encoding is being set to UTF-8 by dependency voku/portable-utf8/src/voku/helper/UTF8.php
What are the steps to reproduce?
Set mb_internal_encoding() to anything but UTF-8, let the composer autoloader do it's thing, and then check mb_internal_encoding(). It should still be set to what you set it to, but you'll see that it has been changed to UTF-8. You don't even have to load/use any PhpOffice code first.
If this is an issue with reading a specific spreadsheet file, then it may be appropriate to provide a sample file that demonstrates the problem; but please keep it as small as possible, and sanitize any confidential information before uploading.
What features do you think are causing the issue
The composer autoloader calls voku/portable-utf8/bootstrap.php, which in turn calls \voku\helper\UTF8::checkForSupport(); which calls mb_internal_encoding('UTF-8'); Also \voku\helper\Bootstrap.php performs an ini_set('default_charset', 'UTF-8') call.
To be clear, one doesn't even have to call any PhpOffice or voku code to experience this issue. Having either phpoffice/phpspreadsheet or voku/portable-utf8 present in composer.json is sufficient.
There is only 1 line in the whole phpoffice/phpspreadsheet project that requires the large voku dependency for cleaning xss from text when writing a worksheet comment. Surely there is a better way to filter xss via one of the many other html / xss filtering packages. Or if it were up to me, just leave it up to the responsibility of the caller to clean xss if and how they want to. I think that's the more hassle-free approach.
Does an issue affect all spreadsheet file formats? If not, which formats are affected?
Yes and it affects much more than that too: the whole application and all data passing through it.
Which versions of PhpSpreadsheet and PHP are affected?
2.0.0
Background info
Recently after upgrading some packages, some database data corruption issues started popping up in an application that had been running perfectly fine for ages. The application initializes various global settings at start up, such as the timezone, mb_internal_encoding(), the ini setting default_charset, etc. The database connection too sets the connection encoding and collation appropriately based on the mb_internal_encoding(). After spending time debugging I discovered that the database connection charset was being set to UTF-8, instead of latin1. Then I found the culprits: voku/portable-utf8/src/voku/helper/UTF8.php and voku/portable-utf8/src/voku/helper/Boostrap.php They blatantly call mb_internal_encoding('UTF-8') and ini_set('default_charset', 'UTF-8') during the autoloader phase. That is really bad practice imho, so I'd recommend removing/replacing that dependency all together. I just can't trust a package like that. Who know's what other global settings it changes.
I have no choice now but to roll back all updates of phpoffice/phpspreadsheet 2.0 to the latest 1.* in every single application that I've updated. I could alternatively call mb_internal_encoding() again after calling the autoloader, but that could lead to unpredictable functionality in both voku/portable-utf8 and phpoffice/phpspreadsheet.