doy / spreadsheet-parsexlsx

parse XLSX files
http://metacpan.org/release/Spreadsheet-ParseXLSX
27 stars 35 forks source link

Incorrect cell values due to phonetic data #72

Open tanyo13 opened 7 years ago

tanyo13 commented 7 years ago

When a user inputs a Japanese string with an ordinary input method on Microsoft Excel, some phonetic information is silently added and stored in the file (and it will be used as hints for sorting etc). Spreadsheet-ParseXLSX-0.27 treats (a part of) the phonetic information as a part of the string stored in a cell. As a result, it returns incorrect values for Japanese string cells in most cases.

Example (wb0.xlsx, created by Microsoft Excel 2013). The string in cell A1 is represented (by pretty-printing sharedString.xml) as:

  <si>
    <t>日本語あいうえお</t>
    <rPh eb="3" sb="0">
      <t>ニホンゴ</t>
    </rPh>
    <phoneticPr fontId="1"/>
  </si>

For this cell, the user typed string "日本語あいうえお" (and Excel shows this string in the cell), but ParseXLSX gives "日本語あいうえおニホンゴ" as the value of the cell. (Phonetic information "ニホンゴ" was added by Excel)

By the way, the problem does not appear when a cell stores character level formatting information using tag, as in cell A2:

  <si>
    <r>
      <rPr>
        <i/>
        <sz val="11"/>
        <color theme="1"/>
        <rFont val="MS Pゴシック"/>
        <family val="3"/>
        <charset val="128"/>
        <scheme val="minor"/>
      </rPr>
      <t>日本</t>
    </r>
    <r>
      <rPr>
        <sz val="11"/>
        <color theme="1"/>
        <rFont val="MS Pゴシック"/>
        <family val="2"/>
        <scheme val="minor"/>
      </rPr>
      <t>語あいうえお</t>
    </r>
    <rPh eb="3" sb="0">
      <t>ニホンゴ</t>
    </rPh>
    <phoneticPr fontId="1"/>
  </si>

In this case, ParseXLSX correctly ignores phonetic information "ニホンゴ" and returns "日本語あいうえお" as the value of the cell. Please confirm the behaviours stated above by running an attached script.

I would propose the following fix (generated with "-w", also attached one without the option), but I am not knowledgable in XML tools, so feel free to apply your own fix. The fix here asuumes that: (1) A possible child of <si> is either <r>, <t>, <rPh> or <phoneticPr>. (2) Any contents in <rPh> and <phoneticPr> tags should be ignored. Assumption (1) comes from https://msdn.microsoft.com/en-us/library/documentformat.openxml.spreadsheet.sharedstringitem(v=office.15).aspx but I do not konw whether all xlsx files follow that. I think assumption (2) is valid for Japanese strings, but do not know for other languages.

--- ParseXLSX.pm-orig   2017-01-05 08:27:22.246634500 +0900
+++ ParseXLSX.pm    2017-01-05 10:06:10.826729900 +0900
@@ -469,9 +469,12 @@

     my $string_text = '';
     my @rich_font_by_cell;
-    my @nodes_r = $si->find_nodes('.//s:r');
-    if (@nodes_r > 0) {
-        for my $chunk (map { $_->children } @nodes_r) {
+    for my $subnode ($si->children) {
+        if ($subnode->name eq 's:t') {
+            $string_text .= $subnode->text;
+        }
+        elsif ($subnode->name eq 's:r') {
+            for my $chunk ($subnode->children) {
             my $string_length = length($string_text);
             if ($chunk->name eq 's:t') {
                 if (!@rich_font_by_cell) {
@@ -531,7 +534,9 @@
         }
     }
     else {
-        $string_text = join '', map { $_->text } $si->find_nodes('.//s:t');
+            # $subnode->name is either 's:rPh' or 's:phoneticPr'
+            # We ignore phonetic information and do nothing.
+        }
     }

     return (