PHPOffice / PhpSpreadsheet

A pure PHP library for reading and writing spreadsheet files
https://phpspreadsheet.readthedocs.io
MIT License
13.35k stars 3.47k forks source link

insertNewRowBefore is more and more slow for large files #2509

Closed momala454 closed 2 years ago

momala454 commented 2 years ago

This is:

- [x ] a bug report
- [ ] a feature request
- [ ] **not** a usage question (ask them on https://stackoverflow.com/questions/tagged/phpspreadsheet or https://gitter.im/PHPOffice/PhpSpreadsheet)

What is the expected behavior?

Adding new rows is always the same speed

What is the current behavior?

insertNewRowBefore to add new rows is more and more slow depending on the number of existing rows. With 0 rows, it takes 0 seconds. With 1000 it takes 0.082s, with 2000 it takes 0.166s

What are the steps to reproduce?

with this minimal code the time is different but the same behaviour occurs.

<?php

require __DIR__ . '/vendor/autoload.php';

$spreadsheet = new \PhpOffice\PhpSpreadsheet\Spreadsheet();
$worksheet = $spreadsheet->getActiveSheet();

$i = 3000;
while ($i-- > 0) {
    $time = microtime(true);
    $newRowNumber = $worksheet->getHighestRow() +1;
    if ($newRowNumber === 2 && !$worksheet->cellExists('A1')) {
        // file is empty
        $newRowNumber = 1;
    }
    $time2 = microtime(true);
    $worksheet->insertNewRowBefore($newRowNumber);
    $time3 = microtime(true);
    $worksheet->fromArray(['a', 'b', 'c', 'd'], null, 'A'.$newRowNumber);

    if ($i % 500 === 0) {
        echo 'line '.$newRowNumber.' '.round($time3 - $time2, 3)."\n";
    }
}

output line 500 0.007 line 1000 0.013 line 1500 0.021 line 2000 0.027 line 2500 0.034 line 3000 0.042

Which versions of PhpSpreadsheet and PHP are affected?

1.21.0 php 8.1

momala454 commented 2 years ago

if I double the number of columns written, it double the time to insertNewRowBefore

line 500 0.013 line 1000 0.027 line 1500 0.042 line 2000 0.057 line 2500 0.072 line 3000 0.087

momala454 commented 2 years ago

real world example : it takes 18hours to write 38000 lines to a xlsx file

MarkBaker commented 2 years ago

Adding new rows is always the same speed

This is never going to happen. Inserting/Removing Rows/Columns is not an O(1) action, it's doing a lot of work internally. Besides shuffling the index for the Cell Collection, It also has to check every cell and see if it needs to adjust formulae or range references for things like print areas and autofilters

Consider a worksheet with just 5 rows:

And you insert a new row before row 4. First of all, the indexes in the Cell Collection need updating so that cell A5 becomes cell A6, and the code also needs to check if cell A4 exists to re-index that as cell A5. Every cell that is in row 4+ needs to be re-indexed. That's an O(n) task, where n is the number of cells that exist on/below the row that you're inserting. It also needs to check the content of every cell to see if any contain formulae; and if so they also need adjusting so that =SUM($A$2:$A4) will become =SUM($A$2:$A5), etc. That's also an O(n) task, where n is the number of cells in the worksheet. Then consider that other worksheets might also have formulae that reference the worksheet where you're doing the insert. Then it also needs to check merge ranges, print areas, row dimensions, data validations, autofilters, conditional styles, and other related elements of the worksheet, adjusting them as well if necessary.

And it needs to do that for every column as well... that can quickly become a lot of cell checking in larger worksheets. Re-indexing is only needed for cells below the row you're inserting; but checking/adjusting formulae needs to be done for all cells in the spreadsheet. If you insert near the bottom of your worksheet, the re-indexing is easier because there are fewer rows that need re-indexing; but formula adjustment is still constant.

If you then do another insert/remove; it has to redo all that work again. That's why it's better to avoid doing multiple inserts if possible, and do a single insert for multiple rows when you can.

If you need to insert 1000 new rows, then doing:

for ($i = 0; i < 1000; ++$i) {
   $worksheet->insertNewRowBefore($newRowNumber);
}

will repeat all that work 1000 times

Doing:

$worksheet->insertNewRowBefore($newRowNumber, 1000);

Only does that work once.

And the more cells/rows/columns that you have in a worksheet, the more cells/rows/columns need checking, and the longer it will take whenever you do an insert or a removal.

momala454 commented 2 years ago

i don't know how many rows i need to add. I'm reading info and writing them in the excel file as needed. I'm always adding rows at the end, so here i don't need to reindex. My document doesn't have any formula or whatever. Is it possible that you keep an index or boolean that know that we don't need to check for all of the things you're currently checking to speed up ?

Like arrays of indexes that contains formulas, or boolean that says if we have any formula, and same for data validation, filler etc ?

Or do you have any solution to speed up things ? I don't know how many rows i need to insert in advance. I could add rows by 1000 and increase the number of rows only when i've wrote 1000 lines, but that's a hack and i may need to delete empty rows when i finished writing the file.

I've tested the delays, and here is what takes the most time (1.4s)

while ($coordinate = array_pop($allCoordinates)) {
            $cell = $worksheet->getCell($coordinate);
            $cellIndex = Coordinate::columnIndexFromString($cell->getColumn());

            if ($cellIndex - 1 + $numberOfColumns < 0) {
                continue;
            }

            // New coordinate
            $newCoordinate = Coordinate::stringFromColumnIndex($cellIndex + $numberOfColumns) . ($cell->getRow() + $numberOfRows);

            // Should the cell be updated? Move value and cellXf index from one cell to another.
            if (($cellIndex >= $beforeColumn) && ($cell->getRow() >= $beforeRow)) {
                // Update cell styles
                $worksheet->getCell($newCoordinate)->setXfIndex($cell->getXfIndex());

                // Insert this cell at its new location
                if ($cell->getDataType() == DataType::TYPE_FORMULA) {
                    // Formula should be adjusted
                    $worksheet->getCell($newCoordinate)
                        ->setValue($this->updateFormulaReferences($cell->getValue(), $beforeCellAddress, $numberOfColumns, $numberOfRows, $worksheet->getTitle()));
                } else {
                    // Formula should not be adjusted
                    $worksheet->getCell($newCoordinate)->setValueExplicit($cell->getValue(), $cell->getDataType());
                }

                // Clear the original cell
                $worksheet->getCellCollection()->delete($coordinate);
            } else {
                /*    We don't need to update styles for rows/columns before our insertion position,
                        but we do still need to adjust any formulae    in those cells                    */
                if ($cell->getDataType() == DataType::TYPE_FORMULA) {
                    // Formula should be adjusted
                    $cell->setValue($this->updateFormulaReferences($cell->getValue(), $beforeCellAddress, $numberOfColumns, $numberOfRows, $worksheet->getTitle()));
                }
            }
        }

just looping to all coordinate and doing those 2 lines takes 0.8s

$cell = $worksheet->getCell($coordinate);
            $cellIndex = Coordinate::columnIndexFromString($cell->getColumn());

and this takes 0.2s

$this->adjustRowDimensions($worksheet, $beforeCellAddress, $numberOfColumns, $beforeRow, $numberOfRows);

all the rest on the function insertNewBefore doesn't take any time .

So i think that if you keep a reference of all cells coordinates that contains a formula, you could loop only to them and also loop to rows that are after the rows we are currently inserting (in my case, there are none however), you can save 63% of the time if you maths are exacts. And for the rest of the time spent, i'm not sure why adjustRowDimensions need to spend time if i'm adding rows at the end of the document

MarkBaker commented 2 years ago

If you're only adding new rows at the end of the worksheet, then why are you calling insertNewRowBefore() anyway?

Yes, it would be possible to maintain a reference to all cells that contain formulae; this is exactly how MS Excel does this, and is able to do so quickly; but MS Excel has the advantage of not needing to worry about memory limits. Not all worksheets are simply data, some are predominantly formulae, and that reference list of cells containing formulae could become very large; and would itself need to be updated by any rows inserted/deleted

momala454 commented 2 years ago

Isn't it needed to add rows when we want to add data? I've read it somewhere. I don't think the list of rows having formula would be huge if you only save it as the coordinates and use an indexed array, accessing it would be fast. So you're telling me that in my example above I can just remove the line insertnewrow and it will work ? Sorry I can't test myself at the moment as I don't have access to my computer right now

MarkBaker commented 2 years ago

No insertNewRowBefore() isn't necessary when adding new data to the end of the worksheet; it's only needed if you want to insert a new row before the last row in the worksheet; otherwise just set the new data in the appropriate cells.

Access to an indexed array would be faster; I experimented with that technique when I first identified that MS Excel uses that approach. But most developers don't use insertNewRowBefore(), so the trade off was between faster speed for something that very few developers use; but at a memory cost that would affect a high proportion of developers. MS actually uses a nested map, that links every formula cell with cells that are referenced in that formula, so it can recalculate only those cells that are directly affected if a cell value changes. But even just an array of cells that contain formulae could become a very large array, taking a lot of memory... and keeping memory limits low is important when spreadsheets can run to millions of cells; particularly if a lot of those cells contain formulae. MS don't need to worry about that memory usage; but PHP developers do. Perhaps once we reach PHP 8.0 as our minimum version, when we can start using weak maps it will be possible to revisit the idea, because weak maps take a lot less memory, and require less maintenance overhead.

momala454 commented 2 years ago

Thank you I will then remove insertnewrow You could however detect that there are no rows after and do nothing in that case to prevent the same error I did ?

momala454 commented 2 years ago

are you sure that a weak map will use less memory than an array like this

$coordinatesUsingFormula = ['A2' => true, 'C12' => true];

?

komisia commented 11 months ago

This is a real problem. I have no formulas, just adding static rows, and it needs a century to append 2000 rows. Hardware does not matter. I tested it on different VMs with different capabilities like 456456GB of RAM, 67867899 CPU and etc...

zeeshandoit commented 6 months ago

I am also facing the same issue and disappointed that it becomes too slow and I get Gateway Timeout error while using this insertNewRowBefore function. I have to insert rows in betwteen and facing the issue. Is there any proper solution to fix the issue or should we switch to .Net if excel is used in any project?