Open remicollet opened 4 years ago
Can you tell me what these tests were exactly and how I can reproduce the problem? On which OS an with which zip library was this? Note: changes between 0.2.26 and 0.2.27 were only in the reader.
Can you tell me what these tests were exactly and how I can reproduce the problem?
To reproduce, from https://github.com/viest/php-ext-xlswriter (with php 7 devel installed)
$ phpize
$ ./configure --with-libxlsxio=/path/to/libxlsio/install/dir --enable-reader
$ make
$ make test
Failing test are
On which OS an with which zip library was this?
On Fedora 31, using libzip 1.7.1
I did not write the PHP library, so I need to reproduce this in C to get to the bottom of this. Can you send me the tutorial.xlsx file used in the tests? I couldn't find it right away.
Thanks. Now I can reproduce the issue when using XLSXIOREAD_SKIP_EMPTY_CELLS.
Great...
If you have a fix, I can try it before a new release.
Okay, I will let you know when I committed new code.
More work than I thought. Not finished yet, but if you like you can already test master if it no longer crashes/hangs.
Using a6ab8e1d061ac07b353631bc64598f71c19c6cf3 PHP extension test suite passes again.
Final issues resolved. It took me a while to make sure all combinations of XLSXIOREAD_SKIPEMPTY* worked well. If it still passes I will release the changes.
Looks like f49a3d1245c854364f66e658751ffb27c65d7de9 breaks something else
TEST 67/84 [tests/open_xlsx_next_row_with_data_type_date_array_index.phpt]
========DIFF========
007+ array(2) {
008+ [0]=>
009+ string(0) ""
010+ [1]=>
011+ string(0) ""
007- array(0) {
========DONE========
Test file used by this test tutorial.xlsx
Can't reproduce in C. What exactly is being tested here?
Can't reproduce in C. What exactly is being tested here?
Not clear to me
@viest can you please help on this ?
IIUC, second row have no value, but 2 empty rows are returned (xlsxioread_sheet_next_cell
)
To clarify, this is the correct behavior. XLSX I/O assumes the Excel files are in a table format containing header rows. So when nothing is skipped an empty line will contain as many empty cells as there are cells in the first row.
To clarify, this is the correct behavior.
Ok, so fixing the tests of the ext side https://github.com/viest/php-ext-xlswriter/pull/283
When nothing is skipped, the empty row will contain as many empty cells as the first row.
This seems unreasonable, there is nothing in worksheet.xml, so you should not add cells.
The basic idea is to read tables from Excel files. The end-user doesn't care much if something is there or not if the cell is part of the table. Maybe another option can be added to ignore the number of columns defined on the first row, but so far nobody has ever resuested this.
but so far nobody has ever resuested this.
Perhaps because it was the old behavior with version <= 0.2.27 ? ;)
In the sheet1.xml
file, the second row does not declare any cells, so it should not return the same number of empty cells as the first row.
If you ignore the sheet.xml file and insert some dummy empty cells, the expenditure in terms of traversal performance will increase.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
<dimension ref="A1:E3"/>
<sheetViews>
<sheetView tabSelected="1" workbookViewId="0"/>
</sheetViews>
<sheetFormatPr defaultRowHeight="15"/>
<sheetData>
<row r="1" spans="1:5">
<c r="B1" t="s">
<v>0</v>
</c>
</row>
<row r="3" spans="1:5">
<c r="A3" t="s">
<v>1</v>
</c>
<c r="E3" s="1">
<v>43726.6259375</v>
</c>
</row>
</sheetData>
<pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
</worksheet>
Let me clarify a bit more the library was designed to read tables, which means the have header rows and then data below them.
+------+------+
| colA | colB |
+------+------+
| data | data |
+------+------+
| data | data |
+------+------+
In this case the table would look like:
+---+---+
| | 0 |
+---+---+
| | |
+---+---+
| 1 | | (with some extra data outside the table)
+---+---+
For the behaviour you're expecting an extra flag could be created to ignore the header line.
If the original idea was to return an equal number of cells based on the table header, you might also need a configuration that allows users to customize the row where the table header is located. In the actual environment, you may encounter many problems. In many cases, the first row is not the table header.
And this modification will destroy the existing application, which needs to be evaluated.
Okay, so in your opinion, would it be better to require a new flag to get the table behaviour and to default to not using the table behaviour?. Adding a function to skip the first N lines has crossed my mind too
Okay, so in your opinion, would it be better to require a new flag to get the table behaviour and to default to not using the table behaviour?.
IMHO: yes, as this will restore old behavior, and make the "table" one new and optional (avoid BC break)
Yes, keeping the API single responsibility helps to achieve more valuable functions, such as the function of php-ext-xlswriter: skip N lines.
To be clear about the change
xlsxioread_sheet_last_column_index
return 0 for not existing cellxlsxioread_sheet_next_cell
behavior is the same
And the PHP extension was used to ignore cell with 0 index.
For now the table behaviour (determining the number of columns based on header row) is only active when using the XLSXIOREAD_SKIP_EXTRA_CELLS flag. This is now in release 0.2.29.
IMHO everything looks consistent
XLSXIOREAD_SKIP_NONE
1
A:
B: Cost
2
3
A: viest
B:
C:
D:
E: 43726.6259375
XLSXIOREAD_SKIP_EMPTY_ROWS
1
A:
B: Cost
2
A: viest
B:
C:
D:
E: 43726.6259375
XLSXIOREAD_SKIP_EMPTY_CELLS
1
B: Cost
2
3
A: viest
E: 43726.6259375
XLSXIOREAD_SKIP_ALL_EMPTY
1
B: Cost
2
A: viest
E: 43726.6259375
XLSXIOREAD_SKIP_EXTRA_CELLS
1
A:
B: Cost
2
A:
B:
3
A: viest
B:
Notice: xlsxioread_sheet_last_row_index
when empty row 2 is skipped, shoudn't row 3 index be preserved (return 2 for now) as xlsxioread_sheet_last_column_index
do for coloumn index ?
xlsxioread_sheet_last_row_index
is the last row seen by the parser, but the skipping of an empty row is done when reading the first cell of a row. So before reading that first cell xlsxioread_sheet_last_row_index
will still show you the index of the last row that wasn't empty.
It's really bad, skip does not mean that he does not exist, so the index also needs to be updated.
As the design is to support streaming data the engine doesn't read ahead, so it doesn't know what it's going to skip.
The value xlsxioread_sheet_last_row_index
will be correct if you check it when the first cell of the row has been read.
The value xlsxioread_sheet_last_row_index will be correct if you check it when the first cell of the row has been read.
OK, But for now,
No skip:
1 A:
1 B: Cost
3 A: viest
3 B:
3 C:
3 D:
3 E: 43726.6259375
Skipping all
1 B: Cost
2 A: viest
2 E: 43726.6259375
When we expect
1 B: Cost
3 A: viest
3 E: 43726.6259375
The value xlsxioread_sheet_last_row_index will be correct if you check it when the first cell of the row has been read.
Using attached test code readxls.txt
I'm actually really happy to find this bug -- I'm hoping to use this library to read spreadsheets that don't have headers at all. The data will represent physical layouts, with codes in each cell to indicate what should be placed at that location. The first line will contain only one cell with a description string. It looks like the behavior I'm looking for will be supported with either SKIP_NONE or SKIP_EMPTY_CELLS.
The text in the README.md might need updating, though:
intended to process .xlsx files as a data table, which assumes the following:
assumes the first row contains header names
assumes the next rows contain values in the same columns as where the header names are supplied
@brechtsanders any idea about to fix this ?
Looks like the change in 0.2.27 creates a behavior change resulting in infinite loop.
Found trying to run https://pecl.php.net/package/xlswriter test suite against 0.2.27 (was ok against 0.2.26)
2 tests are failing with process timeout
@viest for information, if you can also have a look