brechtsanders / xlsxio

XLSX I/O - C library for reading and writing .xlsx files
MIT License
417 stars 112 forks source link

[0.2.27] behavior change in EMPTY management #75

Open remicollet opened 4 years ago

remicollet commented 4 years ago

Looks like the change in 0.2.27 creates a behavior change resulting in infinite loop.

Found trying to run https://pecl.php.net/package/xlswriter test suite against 0.2.27 (was ok against 0.2.26)

2 tests are failing with process timeout

@viest for information, if you can also have a look

brechtsanders commented 4 years ago

Can you tell me what these tests were exactly and how I can reproduce the problem? On which OS an with which zip library was this? Note: changes between 0.2.26 and 0.2.27 were only in the reader.

remicollet commented 4 years ago

Can you tell me what these tests were exactly and how I can reproduce the problem?

To reproduce, from https://github.com/viest/php-ext-xlswriter (with php 7 devel installed)

$ phpize
$ ./configure --with-libxlsxio=/path/to/libxlsio/install/dir    --enable-reader 
$ make
$ make test

Failing test are

On which OS an with which zip library was this?

On Fedora 31, using libzip 1.7.1

brechtsanders commented 4 years ago

I did not write the PHP library, so I need to reproduce this in C to get to the bottom of this. Can you send me the tutorial.xlsx file used in the tests? I couldn't find it right away.

remicollet commented 4 years ago

tutorial.xlsx

brechtsanders commented 4 years ago

Thanks. Now I can reproduce the issue when using XLSXIOREAD_SKIP_EMPTY_CELLS.

remicollet commented 4 years ago

Great...

If you have a fix, I can try it before a new release.

brechtsanders commented 4 years ago

Okay, I will let you know when I committed new code.

brechtsanders commented 4 years ago

More work than I thought. Not finished yet, but if you like you can already test master if it no longer crashes/hangs.

remicollet commented 4 years ago

Using a6ab8e1d061ac07b353631bc64598f71c19c6cf3 PHP extension test suite passes again.

brechtsanders commented 4 years ago

Final issues resolved. It took me a while to make sure all combinations of XLSXIOREAD_SKIPEMPTY* worked well. If it still passes I will release the changes.

remicollet commented 4 years ago

Looks like f49a3d1245c854364f66e658751ffb27c65d7de9 breaks something else

TEST 67/84 [tests/open_xlsx_next_row_with_data_type_date_array_index.phpt]
========DIFF========
007+ array(2) {
008+   [0]=>
009+   string(0) ""
010+   [1]=>
011+   string(0) ""
007- array(0) {
========DONE========
remicollet commented 4 years ago

Test file used by this test tutorial.xlsx

brechtsanders commented 4 years ago

Can't reproduce in C. What exactly is being tested here?

remicollet commented 4 years ago

Can't reproduce in C. What exactly is being tested here?

Not clear to me

@viest can you please help on this ?

IIUC, second row have no value, but 2 empty rows are returned (xlsxioread_sheet_next_cell)

brechtsanders commented 4 years ago

To clarify, this is the correct behavior. XLSX I/O assumes the Excel files are in a table format containing header rows. So when nothing is skipped an empty line will contain as many empty cells as there are cells in the first row.

remicollet commented 4 years ago

To clarify, this is the correct behavior.

Ok, so fixing the tests of the ext side https://github.com/viest/php-ext-xlswriter/pull/283

viest commented 4 years ago

When nothing is skipped, the empty row will contain as many empty cells as the first row.

This seems unreasonable, there is nothing in worksheet.xml, so you should not add cells.

brechtsanders commented 4 years ago

The basic idea is to read tables from Excel files. The end-user doesn't care much if something is there or not if the cell is part of the table. Maybe another option can be added to ignore the number of columns defined on the first row, but so far nobody has ever resuested this.

remicollet commented 4 years ago

but so far nobody has ever resuested this.

Perhaps because it was the old behavior with version <= 0.2.27 ? ;)

viest commented 4 years ago

In the sheet1.xml file, the second row does not declare any cells, so it should not return the same number of empty cells as the first row.

If you ignore the sheet.xml file and insert some dummy empty cells, the expenditure in terms of traversal performance will increase.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
           xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
    <dimension ref="A1:E3"/>
    <sheetViews>
        <sheetView tabSelected="1" workbookViewId="0"/>
    </sheetViews>
    <sheetFormatPr defaultRowHeight="15"/>
    <sheetData>
        <row r="1" spans="1:5">
            <c r="B1" t="s">
                <v>0</v>
            </c>
        </row>
        <row r="3" spans="1:5">
            <c r="A3" t="s">
                <v>1</v>
            </c>
            <c r="E3" s="1">
                <v>43726.6259375</v>
            </c>
        </row>
    </sheetData>
    <pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
</worksheet>
brechtsanders commented 4 years ago

Let me clarify a bit more the library was designed to read tables, which means the have header rows and then data below them.

+------+------+
| colA | colB |
+------+------+
| data | data |
+------+------+
| data | data |
+------+------+

In this case the table would look like:

+---+---+
|   | 0 |
+---+---+
|   |   |
+---+---+
| 1 |   |  (with some extra data outside the table)
+---+---+

For the behaviour you're expecting an extra flag could be created to ignore the header line.

viest commented 4 years ago

If the original idea was to return an equal number of cells based on the table header, you might also need a configuration that allows users to customize the row where the table header is located. In the actual environment, you may encounter many problems. In many cases, the first row is not the table header.

And this modification will destroy the existing application, which needs to be evaluated.

brechtsanders commented 4 years ago

Okay, so in your opinion, would it be better to require a new flag to get the table behaviour and to default to not using the table behaviour?. Adding a function to skip the first N lines has crossed my mind too

remicollet commented 4 years ago

Okay, so in your opinion, would it be better to require a new flag to get the table behaviour and to default to not using the table behaviour?.

IMHO: yes, as this will restore old behavior, and make the "table" one new and optional (avoid BC break)

viest commented 4 years ago

Yes, keeping the API single responsibility helps to achieve more valuable functions, such as the function of php-ext-xlswriter: skip N lines.

remicollet commented 4 years ago

To be clear about the change

xlsxioread_sheet_next_cell behavior is the same

And the PHP extension was used to ignore cell with 0 index.

brechtsanders commented 4 years ago

For now the table behaviour (determining the number of columns based on header row) is only active when using the XLSXIOREAD_SKIP_EXTRA_CELLS flag. This is now in release 0.2.29.

remicollet commented 4 years ago

IMHO everything looks consistent

XLSXIOREAD_SKIP_NONE

    1
      A: 
      B: Cost
    2
    3
      A: viest
      B: 
      C: 
      D: 
      E: 43726.6259375

XLSXIOREAD_SKIP_EMPTY_ROWS

    1
      A: 
      B: Cost
    2
      A: viest
      B: 
      C: 
      D: 
      E: 43726.6259375

XLSXIOREAD_SKIP_EMPTY_CELLS

    1
      B: Cost
    2
    3
      A: viest
      E: 43726.6259375

XLSXIOREAD_SKIP_ALL_EMPTY

    1
      B: Cost
    2
      A: viest
      E: 43726.6259375

XLSXIOREAD_SKIP_EXTRA_CELLS

    1
      A: 
      B: Cost
    2
      A: 
      B: 
    3
      A: viest
      B: 

Notice: xlsxioread_sheet_last_row_index when empty row 2 is skipped, shoudn't row 3 index be preserved (return 2 for now) as xlsxioread_sheet_last_column_index do for coloumn index ?

brechtsanders commented 4 years ago

xlsxioread_sheet_last_row_index is the last row seen by the parser, but the skipping of an empty row is done when reading the first cell of a row. So before reading that first cell xlsxioread_sheet_last_row_index will still show you the index of the last row that wasn't empty.

viest commented 4 years ago

It's really bad, skip does not mean that he does not exist, so the index also needs to be updated.

brechtsanders commented 4 years ago

As the design is to support streaming data the engine doesn't read ahead, so it doesn't know what it's going to skip. The value xlsxioread_sheet_last_row_index will be correct if you check it when the first cell of the row has been read.

remicollet commented 4 years ago

The value xlsxioread_sheet_last_row_index will be correct if you check it when the first cell of the row has been read.

OK, But for now,

No skip:

      1 A: 
      1 B: Cost
      3 A: viest
      3 B: 
      3 C: 
      3 D: 
      3 E: 43726.6259375

Skipping all

      1 B: Cost
      2 A: viest
      2 E: 43726.6259375

When we expect

      1 B: Cost
      3 A: viest
      3 E: 43726.6259375

The value xlsxioread_sheet_last_row_index will be correct if you check it when the first cell of the row has been read.

Using attached test code readxls.txt

myk002 commented 4 years ago

I'm actually really happy to find this bug -- I'm hoping to use this library to read spreadsheets that don't have headers at all. The data will represent physical layouts, with codes in each cell to indicate what should be placed at that location. The first line will contain only one cell with a description string. It looks like the behavior I'm looking for will be supported with either SKIP_NONE or SKIP_EMPTY_CELLS.

The text in the README.md might need updating, though:

intended to process .xlsx files as a data table, which assumes the following:
  assumes the first row contains header names
  assumes the next rows contain values in the same columns as where the header names are supplied
remicollet commented 4 years ago

@brechtsanders any idea about to fix this ?