brechtsanders / xlsxio

XLSX I/O - C library for reading and writing .xlsx files
MIT License
417 stars 112 forks source link

Implement xlsxioread_sheet_next_cell_struct with cell types #47

Open Paxa opened 5 years ago

Paxa commented 5 years ago

I took suggested implementation from https://github.com/brechtsanders/xlsxio/issues/12 and add new method to keep existing API same

Fixes https://github.com/brechtsanders/xlsxio/issues/12

Now I found similar fork https://github.com/matejd/xlsxio/commit/a0b6a0b95e16ffc82f158eb3f64a268c8fee8bf5

Travis status: Build Status (https://travis-ci.org/Paxa/xlsxio)

How to use xlsxioread_sheet_next_cell_struct:

  xlsxioreader xlsxioread;

  if ((xlsxioread = xlsxioread_open(filename)) == NULL) {
    fprintf(stderr, "Error opening .xlsx file\n");
    return 1;
  }

  xlsxioread_cell value;
  const char *typeNames[] = { "none", "value", "boolean", "string", "date" };

  printf("Contents of first sheet:\n");
  xlsxioreadersheet sheet = xlsxioread_sheet_open(xlsxioread, NULL, XLSXIOREAD_SKIP_EMPTY_ROWS);
  while (xlsxioread_sheet_next_row(sheet)) {
    while ((value = xlsxioread_sheet_next_cell_struct(sheet)) != NULL) {
      /*
        value is a struct:
            size_t row_num;
            size_t col_num;
            char* data;
            cell_type_enum cell_type;
            char* number_fmt; - only for numbers and dates
      */
      printf("row_num = %zu \n", value->col_num);
      printf("col_num = %zu \n", value->row_num);
      printf("type = %s\n", typeNames[value->cell_type]);
      if (value->number_fmt) {
        printf("number_format = %s value = %s\n", value->number_fmt);
      }
      XML_Char_printf(X("value = %s\n"), value->data);
      free(value);
    }
    printf("\n");
  }
  xlsxioread_sheet_close(sheet);
  xlsxioread_close(xlsxioread);

How to use xlsxioread_debug_internals: it prints internal structures with extracted number formats (just for debugging)

  xlsxioreader xlsxioread;

  if ((xlsxioread = xlsxioread_open(filename)) == NULL) {
    fprintf(stderr, "Error opening .xlsx file\n");
    return 1;
  }
  xlsxioread_debug_internals(xlsxioread);
  xlsxioread_close(xlsxioread);

Output:

XLSXIO_READ DEBUG - test_dates.xlsx
number_formats:
  number_format 0 - 0
  number_format 1 - 164
  number_format 2 - 165
date_formats:
  date_format id = 164   isDate = 0  fmt = #,##0.00
  date_format id = 165   isDate = 1  fmt = [$-409]m/d/yy h:mm AM/PM;@
Paxa commented 5 years ago

Do you know how to install specify minizip path in windows? Travis builds works only for windows and mac now

brechtsanders commented 5 years ago

You can tell XLSX I/O buid to use a specific Minizip install location with the CMake parameter -DMINIZIP_DIR:PATH=

I have started reviewing some of your code, and already have several remarks:

Paxa commented 5 years ago

I not sure about licenses, will it be better if we download utarray.h and uthash.h in a make file? utarray.h and uthash.h is only header files with macros, not sure if we can compile it as a library Do you have any suggestions for alternatives?

I don't think we gonna have many styles, so performance not so critical. And in our use case we just need append and get, may be implement own linked list for that

Paxa commented 5 years ago

What should we do with malloc? check if (res == NULL) { return NULL }?

Paxa commented 5 years ago

I deleted files inside build folder, is it ok? (I don't know what those files was for) btw I made all these changes to create ruby library https://github.com/Paxa/fast_excel_reader according to my benchmark already got 9-15 times improvement

brechtsanders commented 5 years ago

I usually do:

if ((s = (char*)malloc(wantedsize)) == NULL)
  return NULL; //or some error code

that way you don't assume malloc worked causing the application to crash or even to allow hacker to exploit this.

The build folder contains the Code::Blocks project files, which I use to develop and debug on Windows. Please don't delete it.

uthash is a package in most package systems (e.g. uthash-dev apt package on Debian Linux). I would just add a CMake parameter to locate it for systems that don't have it in their package manager (like Windows). But can you tell me why exactly we need this? I'm not keen to add dependancies...

Question: Do your changes look at the type on cell-level, or also to colomn or row settings in case it's not defined on cell level?

Paxa commented 5 years ago

Will try to make malloc as suggested and bring back build/ folder

But can you tell me why exactly we need this? I'm not keen to add dependancies...

it was in code uploaded in #12 and I keep it. It is used to find relations between cell and number format:

Example styles.xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <numFmts count="2">
    <numFmt numFmtId="164" formatCode="#,##0.00"/>
    <numFmt numFmtId="165" formatCode="[$-409]m/d/yy h:mm AM/PM;@"/>
  </numFmts>
  ...
  <cellStyleXfs count="1">
    <xf numFmtId="0" fontId="0" fillId="0" borderId="0"/>
  </cellStyleXfs>
  <cellXfs count="3">
    <xf numFmtId="0" fontId="0" fillId="0" borderId="0" xfId="0"/>
    <xf numFmtId="164" fontId="0" fillId="0" borderId="0" xfId="0" applyNumberFormat="1"/>
    <xf numFmtId="165" fontId="0" fillId="0" borderId="0" xfId="0" applyNumberFormat="1"/>
  </cellXfs>
  <cellStyles count="1">
    <cellStyle name="Normal" xfId="0" builtinId="0"/>
  </cellStyles>
  <dxfs count="0"/>
  <tableStyles count="0" defaultTableStyle="TableStyleMedium9" defaultPivotStyle="PivotStyleLight16"/>
</styleSheet>

Example sheet.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
  <dimension ref="A1:G5"/>
  ...
  <cols>
    <col min="2" max="2" width="20.7109375" style="1" customWidth="1"/>
    <col min="3" max="3" width="20.7109375" style="2" customWidth="1"/>
  </cols>
  <sheetData>
    <row r="1" spans="1:7">
      <c r="A1" t="s">
        <v>0</v>
      </c>
      <c r="B1" s="1">
        <v>1234</v>
      </c>
      <c r="C1" s="2">
        <v>43537.51777806081</v>
      </c>
      <c r="D1">
        <v>1234</v>
      </c>
      <c r="E1" t="b">
        <v>1</v>
      </c>
      <c r="F1">
        <f>(B1 * 2)</f>
        <v>0</v>
      </c>
      <c r="G1" t="s">
        <v>1</v>
      </c>
    </row>
    <row r="5" s="2" customFormat="1" ht="20" customHeight="1"/>
  </sheetData>
  <hyperlinks>
    <hyperlink ref="G1" r:id="rId1"/>
  </hyperlinks>
  <pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
</worksheet>

In example files

  1. <c r="C1" s="2"> means cell have style = 2
  2. style number 2 is <xf numFmtId="165" fontId="0" fillId="0" borderId="0" xfId="0" applyNumberFormat="1"/>, it reference to number format 165
  3. which is <numFmt numFmtId="165" formatCode="[$-409]m/d/yy h:mm AM/PM;@"/>

From format code we can recognize it's a date, not just number. This code also consider builtin number formats 14 - 22.

If <numFmts ...> and <cellXfs ...> will always have count attribute then we can read it and allocate simple array of int for style-format relations and struct for number formats, or event array of style_id->number format struct (should be fast and simple).


For now only on cell-level, I tried to create file with row format and column format, but when extract xml files, and each cell will have style reference e.g s="2" If you have example of file with column format or row format - I will try to make it work

Btw where do you find format specification?

brechtsanders commented 5 years ago

Styles are numbered. That's why I was wondering why hashing was used. A (sparse) array structure would do for this. But I guess it will do for now. Worst case I witch the license from MIT to BSD.

How much of the API is modified? I'm assuming your changes are not very backwards compatible?

Paxa commented 5 years ago

I changed code to use arrays of structs, but not sure if I implemented it correctly, especially memory allocation and memory releasing. So not using utarray and uthash anymore

How much of the API is modified? I'm assuming your changes are not very backwards compatible?

Should be totally backwards compatible, changed xlsxioread_sheet_next_cell to xlsxioread_sheet_next_cell_struct and added xlsxioread_sheet_next_cell that returns just string

What is your preferences for variable names? camelCase or snake_case? e.g. handle->numberFormats or handle->number_formats?

brechtsanders commented 5 years ago

Wow, you're quick! I avoid camelCase in pure C. I stick with lower case and use underscore to separate things that are separate (e.g. get_numberformat).

Paxa commented 5 years ago

lower case and use underscore to separate things that are separate (e.g. get_numberformat).

done

Paxa commented 5 years ago

may be you know how to install minizip on windows? or is windows build is important at all?

brechtsanders commented 5 years ago

For me MinGW in Windows is my primary development platform, but I also make sure it builds on Debian Linux and macOS. MiniZip 1.2.8 (downloaded from http://www.gaia-gis.it/gaia-sins/dataseltzer-sources/) builds fine on Windows with MinGW using CMake but for me it needed this patch to avoid WinRT specifics:

patch -ulbf iowin32.c << EOF
--- iowin32.c  2013-04-29 00:57:12 +0200
+++ iowin32.c  2014-08-09 23:11:20 +0200
@@ -28,3 +28,3 @@

-#if defined(WINAPI_FAMILY_PARTITION) && (!(defined(IOWIN32_USING_WINRT_API)))
+#if defined(WINAPI_FAMILY_PARTITION) && (!(defined(IOWIN32_USING_WINRT_API))) && !defined(__MINGW32__)
 #if WINAPI_FAMILY_PARTITION(WINAPI_PARTITION_APP)
EOF

If you need binaries I could send them to you if you like.

AlexanderBurtasov commented 5 years ago

Hi, all. (Excuse my poorly-bad English). I'm trying to use this PR in my project and facing with problem: Date-format cells is recognised as "none", not "date".

Paxa commented 5 years ago

Do you have example of a file?

AlexanderBurtasov commented 5 years ago

mixed.xlsx

Paxa commented 5 years ago

Thank you @AlexanderBurtasov for founding a bug. I fixed it

AlexanderBurtasov commented 5 years ago

17596_reduced.xlsx Hi again. PR completely falls while loading this file. Falls while calling xlsxioread_open.

Paxa commented 5 years ago

@AlexanderBurtasov fixed