doy / spreadsheet-parsexlsx

parse XLSX files
http://metacpan.org/release/Spreadsheet-ParseXLSX
27 stars 35 forks source link

Unexpected (un)escaping according to ooxml-spec #81

Open ft-lie opened 6 years ago

ft-lie commented 6 years ago

decoding some xlsx files, i get unexpected_x000D_ characters in the cells.

according to https://msdn.microsoft.com/en-us/library/ff534667(v=office.12).aspx

these are escaped characters.

so text should be filtered with something like $val=~ s/_x([0-9a-fA-F]{4,4})_/chr(hex($1))/eg if $long_type eq 'Text'

just before Cell is created. (or filter $string_text)?

Example from a sharedStrings.xml part:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="584" uniqueCount="99"><si><t>Person</t></si><si><t>Email</t></si><si><t>Mobil</t></si><si><t>Kommentarer</t></si><si><t>Bekrefter _x000D_
bindende _x000D_
påmelding</t></si><si><t>Revy</t></si><si><t>3 retters middag _x000D_
'imiti'</t></si><si><t>3 retters vegetar_x000D_
"baccarat"</t></si>