Closed roland-d closed 3 weeks ago
The Xls Reader will not read the file, because it isn't an xls file (Just giving a file an extension of .xls
doesn't make it an ls file), it isn't BIFF-format, but Xml as you note. This is why we provide the IOFactory's identify()
method, to determine the correct Reader to use because files are commonly misnamed with the wrong extension for their actual file type. The identify()
method looks at the basic content and structure of the file to determine what it really is, not simply what the extension claims it is. The load()
method uses identify()
to determine which Reader should be used,
The Xml Reader will not load it however, because it isn't valid Xml. Entities like ι
as HTML Entities, but aren't Xml Entities. It would only be valid if those entities were actually defined in the file itself, or were used within CDATA blocks. And the file itself is UTF-8, so there is no need to use HTML Entities in the file at all. In practise, we disallow the definition of entities within the Xml that we load, because allowing Entity definition also allows XXE/XEE attacks.
Spreadsheet programs like Excel are clearly more forgiving, and will read this and convert the invalid HTML Entities. It might be possible to add a stream filter to the Xml Reader that would convert the HTML Entities to valid XML Entities before it's passed to simplexml_load_string()
, but I'd have to investigate that further
Hello Mark, thank you for the detailed explanation. I was wondering about the CDATA tags but makes sense they should be used in this case.
It might be possible to add a stream filter to the Xml Reader that would convert the HTML Entities to valid XML Entities before it's passed to simplexml_load_string(), but I'd have to investigate that further
Let me know if there is anything I can do to help.
The following appears to work:
public static function unentity(string $contents): string
{
$contents = preg_replace('/&(amp|lt|gt|quot|apos);/', "\u{fffe}\u{feff}\$1;", trim($contents)) ?? $contents;
$contents = html_entity_decode($contents, ENT_NOQUOTES | ENT_SUBSTITUTE | ENT_HTML401, 'UTF-8');
$contents = str_replace("\u{fffe}\u{feff}", '&', $contents);
return $contents;
}
public static function copyAsXlsx() {
$infile = 'issue.2157.xml';
$contents = file_get_contents($infile);
$contents = self::unentity($contents);
$reader = new \PhpOffice\PhpSpreadsheet\Reader\Xml();
$spreadsheet = $reader->loadSpreadsheetFromString($contents);
// You should be able to do whatever you want to
// the spreadsheet object here.
$writer = new \PhpOffice\PhpSpreadsheet\Writer\Xlsx($spreadsheet);
$outfile = 'issue.2157.unentity.xlsx';
$writer->save($outfile);
echo "saved $outfile\n";
}
A case could be made for adding the unentity
code to Xml Reader, either automatically or as an opt-in switch. I don't know that I'm prepared to make that case, but I'll leave this open and see if someone else is. Note that, in addition to the html entities problem, the supplied file isn't treated as valid Xml because it has whitespace preceding the <?xml
declaration; that is why unentity
has a trim
function.
For some reason, Security Scanner has a callback function. This makes it just too easy to implement 'unentity' as described above. I will open a PR to do so.
This is:
Not even sure if this is a bug report or a feature request as I could not find if this type of XLS file is supported.
What is the expected behavior?
PhpSpreadsheet to read the file as I can open it just fine in OpenOffice.
What is the current behavior?
Trying to read the file I get the error
The filename products.xls is not recognised as an OLE file
trying to read it asXls
. Trying to read it asXml
because the content is just XML, generates a lot of PHP warnings:What are the steps to reproduce?
The file used in this example: ProductsAPI-14-06-2021.xls
Please provide a Minimal, Complete, and Verifiable example of code that exhibits the issue without relying on an external Excel file or a web server:
This is trying to read it as XLS:
This is trying to read it as XML:
I have also tried it with this code:
That results in the same output as trying to read it as an XML file above.
Which versions of PhpSpreadsheet and PHP are affected?
1.18.0