PHPOffice / PhpSpreadsheet

A pure PHP library for reading and writing spreadsheet files
https://phpspreadsheet.readthedocs.io
MIT License
13.36k stars 3.47k forks source link

"Allowed memory size of XXXXXXXXX bytes exhausted" when reading some XLS files #2781

Open vaites opened 2 years ago

vaites commented 2 years ago

This is:

- [x] a bug report
- [ ] a feature request
- [ ] **not** a usage question

I'm working with a file generated by another software that I need to fill: read the file, fill some cells and save it again. I use a reader to get the Spreadsheet instance (so I can set the desired cell values) and a writer to save it.

What is the expected behavior?

Excel file loaded without issues

What is the current behavior?

PHP runs ou of memory

What are the steps to reproduce?

Please provide a Minimal, Complete, and Verifiable example of code that exhibits the issue without relying on an external Excel file or a web server:

<?php

require __DIR__ . '/vendor/autoload.php';

$path = "C:\\path\\to\\file.xls"; // use the attached XLS
$reader = \PhpOffice\PhpSpreadsheet\IOFactory::createReaderForFile($path);
$worksheet = $reader->load($path);

// before dump, will run out of memory
die(var_dump($worksheet->getActiveSheet()->toArray()));

After digging into the reader code I found that the range cell is huge (function shrinkRangeToFit returns E9:GU5420) so the reader thinks the file big (hundred/thousands of empty cells) but the error can be avoided if:

I tried this, without success:

I even but the

What features do you think are causing the issue

Does an issue affect all spreadsheet file formats? If not, which formats are affected?

Tested with XLS and XLSX (attached file)

Which versions of PhpSpreadsheet and PHP are affected?

As far as I know, all of v1 versions (tried with 1.21, 1.22 and 1.23). I tried both on Windows 10 and Ubuntu 20.04 LTS using PHP 8.1.

shirne commented 2 years ago

Maybe your file is created by wpf or other office softwares? Try to use export "ms excel" type from other office softwares.

vaites commented 2 years ago

The exported file can be opened without problems with any Excel version, so I assume this library must open it also without problems.

Anyway, the attached example is a XLS file created with Excel, and it fails.

maximnl commented 2 years ago

I have a similar issue, I am able to load between 10K and 20K rows but not more. i tried chunked row filter, it worked out with CSV files, but the Xlsx reader loads the first chunk and then stops.

vaites commented 2 years ago

Can you explain how you are using the filter? I created a filter but didn't worked for me. Whole spreadsheet is readed...

maximnl commented 2 years ago

I start reading chunks from row 2, so the first chunk goes as it is, starting from the second chunk we splice the array to remove the rows read in the first chunk, they all will be null (in the chunkfilter class, the rows start from 1 , the first raw is always read, you will find many examples)

// create reader ...whatever you need $reader->setLoadSheetsOnly($worksheetname); $chunkFilter = new \PhpOffice\PhpSpreadsheet\Reader\ChunkReadFilter();
$reader->setReadFilter($chunkFilter);

/ Loop to read our worksheet in "chunk size" blocks / for ($startRow = 2; $startRow <= $totalRows; $startRow += $chunkSize) { / Tell the Read Filter which rows we want this iteration / $chunkFilter->setRows($startRow,$chunkSize); echo "Loading rows $startRow -
";
$spreadsheet = $reader->load($file);
$sheetData = $spreadsheet->getActiveSheet()->toArray(null, false, false, true);
if($startRow>2){ array_splice( $sheetData ,1,$startRow-3);}

jkdc commented 2 years ago

Hi have same issue. I've build 2 functionallity, one for export some data and another for import this data. If I remove one line from the file and try to import, when I call to $worksheet->getActiveSheet()->toArray() I get the same error and not make sense for me because I have only 5 rows, not a big large file. It's a bit weird this error because if I edit the file adding some values, or adding new rows and import, everything work fine, the error olny happen when I remove rows. Any ideas? Example files Working --> medidas_work.xlsx Not Working --> medidas.xlsx

MarkBaker commented 2 years ago

@jkdc Ok, looking at the medidas.xlsx file that doesn't work: starting with Worksheet #2 (Medidas), contains references to

This tells PhpSpreadsheet that when you call toArray(), then it should create an array matching the specified dimensions (A1:R1048576), so yes, that will exhaust your memory.

Rather than using toArray(), you could consider processing the data using a row iterator instead, which is a lot more memory efficient than trying to create a giant array to iterate over. If you still prefer to extract the data from the worksheet to an array (and arrays are always memory-heavy in PHP) then work out the maximum row and column that actually contain cell data with calls to getHighestDataColumn() and getHighestDataRow() to identify the data range, the worksheet object even has a handy method called calculateWorksheetDataDimension() to do that for you, and then use the rangeToArray() method instead.

Using the following code to load the file and copy the data into a PHP array:

$inputFileType = 'Xlsx';
$inputFileName = __DIR__ . '/../medidas.xlsx';

$callStartTime = microtime(true);

$reader = IOFactory::createReader($inputFileType);

$spreadsheet = $reader->load($inputFileName);

$callEndTime = microtime(true);
$loadCallTime = $callEndTime - $callStartTime;

echo PHP_EOL;
echo 'Call time to load spreadsheet file was ' , sprintf('%.4f', $loadCallTime) , ' seconds' , PHP_EOL;

$worksheet = $spreadsheet->getActiveSheet();
echo "Worksheet Name: {$worksheet->getTitle()}", PHP_EOL;
echo "Range Dimensions specified in the xlsx file {$worksheet->calculateWorksheetDimension()} that will be used by the toArray() method", PHP_EOL;
echo "Range of cells that contain actual data {$worksheet->calculateWorksheetDataDimension()} that can be passed to the rangeToArray() method", PHP_EOL;

$data = $worksheet->rangeToArray($worksheet->calculateWorksheetDataDimension(), null, true, true, true);

// Echo memory usage
echo ' Current memory usage: ' , (memory_get_usage(true) / 1024) , ' KB' , PHP_EOL;
echo '    Peak memory usage: ' , (memory_get_peak_usage(true) / 1024) , ' KB' , PHP_EOL;

The result:

Worksheet Name: Medidas
Range Dimensions specified in the xlsx file A1:R1048576 that will be used by the toArray() method
Range of cells that contain actual data A1:R5 that can be passed to the rangeToArray() method

Call time to load spreadsheet file was 0.0855 seconds
 Current memory usage: 6144 KB
    Peak memory usage: 6144 KB
jkdc commented 2 years ago

@MarkBaker thanks a lot!! I've tested and like you say is working better using rangeToArray with calculateWorksheetDataDimension. and thanks again for your fast answer.

maximnl commented 2 years ago

In my case i had no issues with data range detection,
range detection takes an extra ,01 sec or so.

see full code for preview of 500 rows of data


$inputFileType=''; // default
$rowsPreview=500;
$chunkSize = $rowsPreview;
$startRow=0;
$tstart=microtime(true);
$worksheetname='';

    if(file_exists($file)){ 
        echo "date modified: " . date ("F d Y H:i:s.", filemtime($file)). '</br>';
        $inputFileType = \PhpOffice\PhpSpreadsheet\IOFactory::identify($file);      
        echo '<p>Detected file type is <kbd>'.$inputFileType.'</kbd></p>';          
        if ($inputFileType=='Xlsx' || $inputFileType=='Xls') {
            $reader = \PhpOffice\PhpSpreadsheet\IOFactory::createReader($inputFileType);            
            //$reader->setReadDataOnly(true);
            $reader->getReadEmptyCells(false);      

            }       
        else if ($inputFileType=='Csv'){
            $reader = \PhpOffice\PhpSpreadsheet\IOFactory::createReader($inputFileType);            
            }
        else { 
              $reader = \PhpOffice\PhpSpreadsheet\IOFactory::createReader($inputFileType); //$spreadsheet = \PhpOffice\PhpSpreadsheet\IOFactory::load($file);
             }

        $s=0;
        //$reader->setReadDataOnly(true);
        $reader->getReadEmptyCells(false);      

        echo 'File data information<br>';
        $worksheetData = $reader->listWorksheetInfo($file);
        foreach ($worksheetData as $worksheet) {
            echo "Tab $s - <kbd>".$worksheet['worksheetName']. '</kbd>  Rows: <kbd>', $worksheet['totalRows'],
            '</kbd> Columns: <kbd>', $worksheet['totalColumns'], '</kbd> Cell Range: <kbd>A1:', $worksheet['lastColumnLetter'], $worksheet['totalRows'].'</kbd>'. ($s==$tab?'<- active':'') .'<br>';
            if($s==$tab){$worksheetname=$worksheet['worksheetName'];}
            $s++;
        }   

        $reader->setLoadSheetsOnly($worksheetname); 
        /**  Instantiate a new Spreadsheet object manually  **/
        //$spreadsheet = new \PhpOffice\PhpSpreadsheet\Spreadsheet();
        $chunkFilter = new \PhpOffice\PhpSpreadsheet\Reader\ChunkReadFilter();
        $reader->setReadFilter($chunkFilter);//->setContiguous(true);
        $chunkFilter->setRows($startRow,$chunkSize);

        //$spreadsheet = $reader->loadIntoExisting($file,$spreadsheet);
        $spreadsheet = $reader->load($file);

        if(is_object($spreadsheet)){
            $sheetData = $spreadsheet->getActiveSheet()->toArray(null, false, true, true);
            //$sheetData = $spreadsheet->getActiveSheet()->rangeToArray($spreadsheet->getActiveSheet()->calculateWorksheetDataDimension(),null, true, true, true); // not working, leads to timeout

            $count = count($sheetData) - 1;
            echo "<p> Preview <kbd>$worksheetname</kbd> sheet. A number of lines: <kbd> $rowsPreview </kbd></p>";
            echo '</div>';

            $valuesData='';
            $recordCount=0;
            echo '<table  class="table table-striped table-sm table-hover table-responsive table-bordered" >';
            echo '<thead class="thead-light">'; // a header

            $counter=0;
            foreach($sheetData as $k => $v) {
                $counter=$counter+1;
                if ($counter>$rowsPreview) {break 1;}

                if($k == 1) {// table data
                    echo '<tr>';
                    echo '<th>row</th>';
                    foreach($v as $key => $value) {
                        echo '<th>' . $key . '</th>';
                    }
                    echo '</tr>';
                }

                echo '<tr><td>'.$k.'</td>';
                foreach($v as $key => $value) {

                    if($k > 1) { 
                        echo '<td>' . $value . '</td>';
                    }
                    else {
                        echo '<th>' . $value . '</th>';
                    }
                }
                echo '</tr>';

                if($k == 1) {// table data
                    echo '</thead>';

                }   

        } // end for each row

        echo '</table>';
            }// sheet is object
    else {echo 'Failed reading';}

$tend=microtime(true);  
echo 'It took '.round($tend - $tstart,2).' seconds <br>';
MarkBaker commented 2 years ago

Whether you're using toArray() or rangeToArray(), pulling the spreadsheet data into an array will still take a significant amount of memory; arrays in PHP are big memory consumers by their nature. And as a general warning, calculateWorksheetDataDimension() still assumes that data starts at cell A1: if you've been chunk-reading, and (for example) loaded the chunk of data between A500 and Z1000, then calculateWorksheetDataDimension() will report a range of A1:Z1000 rather than A500:Z1000, so rangeToArray will still generate a large array with 499 empty rows at the beginning.

If you want to iterate over the rows and columns of worksheet data, and are hitting memory limits doing so, then I'd recommend using iterators rather than pulling it all into an array; and the row/column iterators allow you to specify start and end row/column.

vaites commented 2 years ago

The conversation has led to a different problem than the one originally reported. I'm reporting an error when calling load() and memory is exhausted BEFORE calling toArray().

MarkBaker commented 2 years ago

@vaites And I can't figure out any way of fixing that easily: when the files tells one thing, such as the shrinkRangeToFit returns E9:GU5420, then should PhpSpreadsheet believe it or not?

vaites commented 2 years ago

@MarkBaker, of course the library should trust shrinkRangeToFit, but I think the problem I reported is not related with the spreadsheet size because all works perfectly if I replace strlen($this->data) with mb_strlen($this->data) in the Xls reader. Why? Is there anything wrong with my file encoding? Is this not a bug?. I solved the problem temporarily copying the Xls reader class to my project and making these changes but this is not a valid long-term solution...

There is anything more I can do to provide more information?

MarkBaker commented 2 years ago

There's no logical reason why using mb_strlen($this->data) instead of strlen($this->data) should work, and every reason to believe that it wold create errors where none existed. The BIFF format used for Xls files is a binary stream, so we need to know how many data bytes we're looking at, not characters in a string

vaites commented 2 years ago

It doesn't make sense to me about mb_strlen either but the fact is that it works with that change.

Let me evaluate the problem further to try to determine the exact point where it fails and if I can suggest any changes.

MarkBaker commented 2 years ago

I know that I could do a refactoring of how we store the data validations, so that it's stored it at the worksheet level rather than for each individual cell; but that would be a major bc break, and would entail changes to every Reader and Writer for formats that supports data validation.

manstie commented 2 years ago

This also caused issues for me, with this simple excel file: Bad File.xlsx The entire file is formatted in Arial/10 and for each row you add it seems to get exponentially worse. Just trying to import 20 rows from this file was 5-10 seconds.

MarkBaker commented 2 years ago

@manstie

$inputFileType = 'Xlsx';
$inputFileName = __DIR__ . '/../Bad.File.xlsx';

$callStartTime = microtime(true);

$reader = IOFactory::createReader($inputFileType);

$spreadsheet = $reader->load($inputFileName);

$callEndTime = microtime(true);
$loadCallTime = $callEndTime - $callStartTime;

echo PHP_EOL;
echo 'Call time to load spreadsheet file was ' , sprintf('%.4f', $loadCallTime) , ' seconds' , PHP_EOL;

echo ' Current memory usage: ' , (memory_get_usage(true) / 1024) , ' KB' , PHP_EOL;

$worksheet = $spreadsheet->getActiveSheet();
echo "Worksheet Name: {$worksheet->getTitle()}", PHP_EOL;
echo "Range Dimensions specified in the xlsx file {$worksheet->calculateWorksheetDimension()} that will be used by the toArray() method", PHP_EOL;
echo "Range of cells that contain actual data {$worksheet->calculateWorksheetDataDimension()} that can be passed to the rangeToArray() method", PHP_EOL;

$data = $worksheet->rangeToArray($worksheet->calculateWorksheetDataDimension(), null, true, true, true);

// Echo memory usage
echo ' Current memory usage: ' , (memory_get_usage(true) / 1024) , ' KB' , PHP_EOL;
echo '    Peak memory usage: ' , (memory_get_peak_usage(true) / 1024) , ' KB' , PHP_EOL;

Result:

Call time to load spreadsheet file was 0.1712 seconds
 Current memory usage: 6144 KB
Worksheet Name: Upload
Range Dimensions specified in the xlsx file A1:WVI201 that will be used by the toArray() method
Range of cells that contain actual data A1:B201 that can be passed to the rangeToArray() method
 Current memory usage: 6144 KB
    Peak memory usage: 6144 KB

I don't know what you're doing to "import" rows from the file because you don't explain; but check the actual data size determined by $worksheet->calculateWorksheetDataDimension(), not the reported data size returned by $worksheet->calculateWorksheetDimension().

manstie commented 2 years ago

I am using the Laravel-Excel extension which employs PhpSpreadsheet, so given your example above I assume they just use the toArray method - I'll pass this onto them. I am curious as to where A1:WVI201 comes from, as if you search the XML there is only <dimension ref="A1:B201"/>. Thanks.

MarkBaker commented 2 years ago

But there are column styles defined up to column WVI

<col customWidth="1" bestFit="1" width="15.88671875" max="16129" min="16129"/>

The code that populates the stored value retrieved by a call to calculateWorksheetDimension() looks at all column/row and cell references in the file to identify the max column and row, that included column dimensions that are defined in the file

For most spreadsheet files, using this value isn't a problem; the value is loaded with the file load, and stored in the Worksheet object, and is generally accurate enough (it might be inaccurate as a measure of cells containing data by a few rows or columns, but rarely enough to create any problem)... files where the value is extremely incorrect and high are the exception. It doesn't take "chunking" into account, and It isn't changed if you add/delete rows/columns either. The alternative calculateWorksheetDataDimension() calculates based on actual cell data at the point when the method is called, so it's a lot slower compared with calculateWorksheetDimension() (though I've boosted the method speed in the latest code), It ignores row/column dimensions, but checks whether cell objects exist.... of course, there may be a lot of empty cell objects as well, so it doesn't take that into account, and different developers also have different definitions of what empty means, but I've also added methods to identify empty rows/columns for the next release.

maximnl commented 2 years ago

That is abnormal, I am importing files with 20K plus rows in 10 sec.

You can Send me your file I ll test on my setup.

On Mon, 20 Jun 2022 at 05:54, Max @.***> wrote:

This also caused issues for me, with this simple excel file: Bad File.xlsx https://github.com/PHPOffice/PhpSpreadsheet/files/8937383/Bad.File.xlsx The entire file is formatted in Arial/10 and for each row you add it seems to get exponentially worse. Just trying to import 20 rows from this file was 5-10 seconds.

— Reply to this email directly, view it on GitHub https://github.com/PHPOffice/PhpSpreadsheet/issues/2781#issuecomment-1159938828, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7OOBRJ2D72YQGRNHTLIOLVP7TPJANCNFSM5ULV2K4A . You are receiving this because you commented.Message ID: @.***>

-- Met vriendelijke groet, Maxim

+31 6228 015 28 | @.*** | Gele Rijdersplein 17E - 2R | 6811 AP Arnhem | The Netherlands

manstie commented 2 years ago

... The alternative calculateWorksheetDataDimension() calculates based on actual cell data at the point when the method is called, so it's a lot slower compared with calculateWorksheetDimension()

Is dimension ref not an accurate representation of the dimensions of the data?

That is abnormal, I am importing files with 20K plus rows in 10 sec. You can Send me your file I ll test on my setup.

Don't worry @maximnl, @MarkBaker has explained everything.

MarkBaker commented 2 years ago

Is dimension ref not an accurate representation of the dimensions of the data?

It should be, but isn't. The problem file in this thread posted by jkdc had a <dimension ref="A1:R1048576"/> for a file containing 5 rows of data; or for files generated by other scripts, the dimension ref may not even be set; There may also be additional information to load outside the defined area (print settings, style settings, conditional formatting, data validations, etc) if the file is being used as a template to populate data before saving it; and this will also affect the cached range value.

goetas commented 1 year ago

I have a similar issue. The culprit is the auto filter. I have a spreadsheet with ~10 worksheets each of them ~1000 rows.

With autofilter the memory usage is ~2000MB, without ~40MB. I do not have yet enough internal knowledge of this project to find the reason why the memory is not released after the filter evaluation.

MarkBaker commented 1 year ago

I have a similar issue. The culprit is the auto filter. I have a spreadsheet with ~10 worksheets each of them ~1000 rows.

With autofilter the memory usage is ~2000MB, without ~40MB. I do not have yet enough internal knowledge of this project to find the reason why the memory is not released after the filter evaluation.

If you're doing an autofilter evaluation, and memory isn't being released, then that's a very specific issue. Can you provide an example spreadsheet that demonstrates the problem for testing?

goetas commented 1 year ago

i did investigate a little more.

the issue is that my range is defined as A1:A:100000, that forces the filter to iterate over all the rows in the range and see if needs to hide them ($this->workSheet->getRowDimension((int) $row)->setVisible($result);)

that allocates a RowDimension object for each row for each sheet, and boom, a lot of memory wasted

goetas commented 1 year ago

\PhpOffice\PhpSpreadsheet\Worksheet\AutoFilter line 1053.

goetas commented 1 year ago

Calculating the highest row solves the issue.

I think that it should be mentioned somewhere (docs?) that AutoFilter takes a ton of memory, and that it is recommended to keep the filter range the smallest possible.

goetas commented 1 year ago

I've read better the description for this ticket and seems something different, however I have created the fix for the issue I've encountered https://github.com/PHPOffice/PhpSpreadsheet/pull/3527