PHPOffice / PhpSpreadsheet

A pure PHP library for reading and writing spreadsheet files
https://phpspreadsheet.readthedocs.io
MIT License
13.34k stars 3.47k forks source link

Html Entities in Excel 2003 File #2157

Closed roland-d closed 3 weeks ago

roland-d commented 3 years ago

This is:

- [x] a bug report

Not even sure if this is a bug report or a feature request as I could not find if this type of XLS file is supported.

What is the expected behavior?

PhpSpreadsheet to read the file as I can open it just fine in OpenOffice.

What is the current behavior?

Trying to read the file I get the error The filename products.xls is not recognised as an OLE file trying to read it as Xls. Trying to read it as Xml because the content is just XML, generates a lot of PHP warnings:

HP Warning:  simplexml_load_string(): Entity: line 1261: parser error : Entity 'iota' not defined in /vendor/phpoffice/phpspreadsheet/src/PhpSpreadsheet/Reader/Xml.php on line 120
PHP Warning:  simplexml_load_string(): Οι Κάνφιλντ και & in /vendor/phpoffice/phpspreadsheet/src/PhpSpreadsheet/Reader/Xml.php on line 120
PHP Warning:  simplexml_load_string():                                         ^ in /vendor/phpoffice/phpspreadsheet/src/PhpSpreadsheet/Reader/Xml.php on line 120
PHP Warning:  simplexml_load_string(): Entity: line 1261: parser error : Detected an entity reference loop in /vendor/phpoffice/phpspreadsheet/src/PhpSpreadsheet/Reader/Xml.php on line 120
PHP Warning:  simplexml_load_string(): Οι Κάνφιλντ και & in /vendor/phpoffice/phpspreadsheet/src/PhpSpreadsheet/Reader/Xml.php on line 120
PHP Warning:  simplexml_load_string():                                         ^ in /vendor/phpoffice/phpspreadsheet/src/PhpSpreadsheet/Reader/Xml.php on line 120

What are the steps to reproduce?

The file used in this example: ProductsAPI-14-06-2021.xls

Please provide a Minimal, Complete, and Verifiable example of code that exhibits the issue without relying on an external Excel file or a web server:

This is trying to read it as XLS:

<?php

use PhpOffice\PhpSpreadsheet\IOFactory;

require __DIR__ . '/vendor/autoload.php';
$reader        = IOFactory::createReader('Xls');
$reader->setLoadAllSheets();
$spreadsheet     = $reader->load('products.xls');

This is trying to read it as XML:

<?php

use PhpOffice\PhpSpreadsheet\IOFactory;

require __DIR__ . '/vendor/autoload.php';
$reader        = IOFactory::createReader('Xml');
$reader->setLoadAllSheets();
$spreadsheet     = $reader->load('products.xls');

I have also tried it with this code:

<?php

use PhpOffice\PhpSpreadsheet\IOFactory;

require __DIR__ . '/vendor/autoload.php';

$excel = IOFactory::load('products.xls');

That results in the same output as trying to read it as an XML file above.

Which versions of PhpSpreadsheet and PHP are affected?

1.18.0

MarkBaker commented 3 years ago

The Xls Reader will not read the file, because it isn't an xls file (Just giving a file an extension of .xls doesn't make it an ls file), it isn't BIFF-format, but Xml as you note. This is why we provide the IOFactory's identify() method, to determine the correct Reader to use because files are commonly misnamed with the wrong extension for their actual file type. The identify() method looks at the basic content and structure of the file to determine what it really is, not simply what the extension claims it is. The load() method uses identify() to determine which Reader should be used,

The Xml Reader will not load it however, because it isn't valid Xml. Entities like &iota; as HTML Entities, but aren't Xml Entities. It would only be valid if those entities were actually defined in the file itself, or were used within CDATA blocks. And the file itself is UTF-8, so there is no need to use HTML Entities in the file at all. In practise, we disallow the definition of entities within the Xml that we load, because allowing Entity definition also allows XXE/XEE attacks. Spreadsheet programs like Excel are clearly more forgiving, and will read this and convert the invalid HTML Entities. It might be possible to add a stream filter to the Xml Reader that would convert the HTML Entities to valid XML Entities before it's passed to simplexml_load_string(), but I'd have to investigate that further

roland-d commented 3 years ago

Hello Mark, thank you for the detailed explanation. I was wondering about the CDATA tags but makes sense they should be used in this case.

It might be possible to add a stream filter to the Xml Reader that would convert the HTML Entities to valid XML Entities before it's passed to simplexml_load_string(), but I'd have to investigate that further

Let me know if there is anything I can do to help.

oleibman commented 1 month ago

The following appears to work:

    public static function unentity(string $contents): string
    {
        $contents = preg_replace('/&(amp|lt|gt|quot|apos);/', "\u{fffe}\u{feff}\$1;", trim($contents)) ?? $contents;
        $contents = html_entity_decode($contents, ENT_NOQUOTES | ENT_SUBSTITUTE | ENT_HTML401, 'UTF-8');
        $contents = str_replace("\u{fffe}\u{feff}", '&', $contents);

        return $contents;
    }
    public static function copyAsXlsx() {
        $infile = 'issue.2157.xml';
        $contents = file_get_contents($infile);
        $contents = self::unentity($contents);
        $reader = new \PhpOffice\PhpSpreadsheet\Reader\Xml();
        $spreadsheet = $reader->loadSpreadsheetFromString($contents);
        // You should be able to do whatever you want to
        // the spreadsheet object here.
        $writer = new \PhpOffice\PhpSpreadsheet\Writer\Xlsx($spreadsheet);
        $outfile = 'issue.2157.unentity.xlsx';
        $writer->save($outfile);
        echo "saved $outfile\n";
    }

A case could be made for adding the unentity code to Xml Reader, either automatically or as an opt-in switch. I don't know that I'm prepared to make that case, but I'll leave this open and see if someone else is. Note that, in addition to the html entities problem, the supplied file isn't treated as valid Xml because it has whitespace preceding the <?xml declaration; that is why unentity has a trim function.

oleibman commented 1 month ago

For some reason, Security Scanner has a callback function. This makes it just too easy to implement 'unentity' as described above. I will open a PR to do so.