doy / spreadsheet-parsexlsx

parse XLSX files
http://metacpan.org/release/Spreadsheet-ParseXLSX
27 stars 35 forks source link

quest for speed ... #65

Open Tux opened 8 years ago

Tux commented 8 years ago

I have a 12.5Mb XLSX with 6 sheets. Opens in LibreOffice instantly, takes 620 seconds to read with Spreadsheet::Read/Spreadsheet::ParseXLSX

Hoping to speed that up I used Spreadsheet::ParseXLSX directly like

my @wb;
Spreadsheet::ParseXLSX->new (NotSetCell => 1, CellHandler => sub {
    $wb[$_[1]][$_[2]][$_[3]] = $_[4]->value;
    })->parse ($xf);

but that still takes 513 seconds, and @wb was empty after the run, which only makes things worse :/

having converted that to CSV, reading the data takes just 1 second

why does Spreadsheet::ParseXLSX take that long? LibreOffice opens the same file almost instantly

chuchurocket27 commented 8 years ago

I have spreadsheets significantly smaller than that, and it still takes ages to parse a spreadsheet.

Running Devel::NYTProf shows that the long pole is XML::Twig.

doy commented 8 years ago

Yeah, as far as I can tell, the issue is just that XML::Twig is verrrrry slow. I'm not sure what to do about that - I don't really have the time or the energy to rewrite this module with a different backend (and I'd be pretty worried about introducing a bunch of new parsing errors). Maybe XML::Twig itself could be improved? I'm not really familiar with that codebase at all.

mirod commented 8 years ago

Author of XML::Twig here.

Yes XML::Twig is pretty slow. If you unzip the .xlsx file and parse all the .xml files with XML::Twig, without doing anything with them, you will more or less get the lower limit of the time needed to load the file. That's quite a long time for big files.

That said a quick analysis shows that using namespaces (the map_xmlns option that was added to fix bug 32 I believe) is close to a 50% penalty (ie if you take it out you get 30% better performance). Now of course bug 32 reappears.

There is a bit of hope though:

An other thing: once I remove xml namespace processing, NYTProf gives me Spreadsheet::ParseExcel::Utility::ExcelFmt as the top function in which the code spends time I suspect this may vary with the type of data but at least that's 1 data point.

Other hot functions are: Spreadsheet::ParseExcel::Utility::ExcelLocaltime and Spreadsheet::ParseExcel::FmtDefault::ValFmt.

Also the handler fors:sheetData/s:row. In this one you may want to use $cell->first_descendant( 's:t') instead of ($cell->find_nodes( './/s:t'))[0], and if you don't mind being a bit dirty, use direct hash access to attributes ( $row_elt->{att}->{r} instead of $row_elt->att( 'r')).

Let me know if I can be of more help.

internationils commented 7 years ago

I'm reading a 12-15MB file, and it's slow as well. It is slow at the my $xls = $parser->parse($infile); line, the reading itself is not much different from CSV reading. Would it be possible to speed that up? One possibility might be ignoring formats (i.e. I only want the raw data, no number etc. formats, alignments, etc.), so maybe some of the load could be pushed back to ->value() vs ->unformatted() ?