dhatim / fastexcel

Generate and read big Excel files quickly
Other
644 stars 116 forks source link

xlsx with new Lines (CR+LF) : Parsing error #392

Closed olivbrau closed 2 months ago

olivbrau commented 4 months ago

Hi, I'm trying to open an xslx file. I get this error : Current state not START_ELEMENT, END_ELEMENT or ENTITY_REFERENCE This error lies in OPCPackage.extractFormat() : In the while loop, the instruction if ("numFmt".equals(reader.getLocalName())) { creates en error because the current token type of the reader is CHARACTERS, and then getLocalName() create an exception. It is because my xlsx file has new lines (CRL+LF) in the styles.xml and so, when we are in the loop that comes after getting the <cellXFS>, insideCellXfsis true, but we can't call getLocalName() since there are CHARACTERS (the new line). I hope I'm clear in spite of my bad english.

To get this error, take a valid xlsx file, and put CR+LF after each tag in the styles.xml Excel doesn't put CRLF in the xml it creates, but my xlsx file comes from another tools which put theses CRLF. I think that xlsx format doesn't forbid this, so fastexcel should consider this possibility.

Another side effect of adding CRLF after each tag, is when fastexcel reads sharedStrings.xml : instead of keeping the text in <t> tags, it retrieves all the text between <si> tags, so including all the CRLF. As a consequence, the strings returned are not the good one (but it doesn't crash the reading, compare to the styles.xml problem explained above).

meiMingle commented 2 months ago

I cannot reproduce this problem locally. Can you upload a copy of the xlsx file that caused the error?

olivbrau commented 2 months ago

ExcelFileWithCRLF.xlsx Sure. Can you try with this one ? (don't open-save the file with excel, because excel automatically delete all the CRLF)

olivbrau commented 2 months ago

Hello, Thanks a lot for the fix. However, I'm wondering if this fix also the 2nd bug I mentioned : in the sharedStrings.xml, if there are CRLF after <si> tag, the shared string read is false (there are no exception thrown however) : the string contains CRLF, because readUpTo() reads every characters between <si> and </si>, including the CRLF before <t> and after </t>

ex. if the string is written like this (like in the excel file I uploaded) : `CRLF

MyStringCRLF ` -->`readUpTo()` should read only what is between `` and ``
meiMingle commented 2 months ago

I may not have noticed the second issue you mentioned, I will take the time to look into it. But probably not soon because currently I have some things to do outside of the coding world.

meiMingle commented 2 months ago

This should be fully fixed since #419 was merged,don't you think? @olivbrau