brechtsanders / xlsxio

XLSX I/O - C library for reading and writing .xlsx files
MIT License
417 stars 112 forks source link

Leading and trailing blanks lost when opening in Excel. #57

Closed cscholtes closed 4 years ago

cscholtes commented 4 years ago

We used xlsxio to create a xslx with a text cell containing some leading or trailing blanks (e.g.: " a b "). Reading it back with xlsxio works as expected and delivers the exact same string (" a b ") for that text cell. But, opening it in Excel trims the blanks off of that cell (leaving only "a b"). Creating the same content using Excel, the content is not changed after saving and loading (always yielding " a b "). Apparently the result can even be read by xlsxio, also producing the correct content (" a b ").

We found out that Excel automatically adds an attribute xml:space="preserve" to text-nodes containing leading or trailing blanks. We then changed in function xlsxiowrite_add_cell_string in file xlsxio\lib\xlsxio_write.c the line

write_cell_data(handle, NULL, "<c t=\"inlineStr\"" STYLE_ATTR(STYLE_TEXT) COLNRTAG "><is><t>", "</t></is></c>", "%s", value);

to

write_cell_data(handle, NULL, "<c t=\"inlineStr\"" STYLE_ATTR(STYLE_TEXT) COLNRTAG "><is><t xml:space=\"preserve\">", "</t></is></c>", "%s", value);

That appears to have fixed the issue.

We 'd be glad if you fixed that in the repository, too. (Maybe you could

or something like that.)

Kind regards.

cscholtes commented 4 years ago

Automating the addition of the attribute only for problematic cells, we encountered some surprises. The following appears to work for most cases:

Before the definition of function xlsxiowrite_add_cell_string add:

static int has_leading_or_trailing_blanks(const char* value)
{
  if (NULL == value)
  {
    return 0;
  }
  if (isspace(*(unsigned char*)value))
  {
    return -1;
  }
  if ('\0' == *value)
  {
    return 0;
  }
  size_t Length = strlen(value);
  /*
  The cast to unsigned char* is required to avoid a crash on platforms where char is signed.
  The crash occurs when a text cell contains a multi byte character
  such as ß in utf8 (c3 9f) at its end (as in "...ß").
  */
  return isspace(((unsigned char*)value)[Length - 1]);
}

Replace the line from the original post with the following block:

{
  if (has_leading_or_trailing_blanks(value))
  {
    write_cell_data(handle, NULL, "<c t=\"inlineStr\"" STYLE_ATTR(STYLE_TEXT) COLNRTAG "><is><t xml:space=\"preserve\">", "</t></is></c>", "%s", value);
  }
  else
  {
    write_cell_data(handle, NULL, "<c t=\"inlineStr\"" STYLE_ATTR(STYLE_TEXT) COLNRTAG "><is><t>", "</t></is></c>", "%s", value);
  }
}

The checks for leading and trailing spaces are obviously not sufficient for multi byte characters that count as spaces (if something like that exists).

A performance drawback of this automatic approach is that it has to scan all strings without a leading space to discover their length...

brechtsanders commented 4 years ago

Thanks for trying all this. But shouldn't I just always add xml:space="preserve"? The XMLs are compressed anyway, so a recurring phrase like that won't bloat the xlsx file...

cscholtes commented 4 years ago

That sounds reasonable, especially considering the overhead involved with checking for the last character.

On the other hand, when I look at the strings that Excel stores (in shared strings), only those texts have the attribute that need it. I don't know where else this would matter (Perhaps, the auto trimming behaviour of cells without this attribute is desired in some circumstances?). And then, for large tables of (small) texts (that usually wouldn't need the attribute) avoiding the attribute might still make a considerable difference in size. So we tried to stick as close to the observed Excel behaviour as possible.

You might consider offering different flavours of the text adding function, simultaneously:

brechtsanders commented 4 years ago

I have now added xml:space="preserve" to header and text cells to preserve spacing. In xlsxio_write.c I defined a macro need_space_preserve_attr(value) which currently always returns non-zero, but in the future we could change this to detect if spase preservation is needed. This change will cause the sheet XML file to grow in size, but since it's a recurring pattern the ZIP compression will compress this well. For the test file generated by the example application this means 369 bytes extra (from 29435 to 29804 bytes = 1,25% increase).

brechtsanders commented 4 years ago

Fixed in release 0.2.25