quest for speed ... - Githubissues

Tux commented 8 years ago

I have a 12.5Mb XLSX with 6 sheets. Opens in LibreOffice instantly, takes 620 seconds to read with Spreadsheet::Read/Spreadsheet::ParseXLSX

Hoping to speed that up I used Spreadsheet::ParseXLSX directly like

my @wb;
Spreadsheet::ParseXLSX->new (NotSetCell => 1, CellHandler => sub {
    $wb[$_[1]][$_[2]][$_[3]] = $_[4]->value;
    })->parse ($xf);

but that still takes 513 seconds, and @wb was empty after the run, which only makes things worse :/

having converted that to CSV, reading the data takes just 1 second

why does Spreadsheet::ParseXLSX take that long? LibreOffice opens the same file almost instantly

chuchurocket27 commented 8 years ago

I have spreadsheets significantly smaller than that, and it still takes ages to parse a spreadsheet.

Running Devel::NYTProf shows that the long pole is XML::Twig.

doy commented 8 years ago

Yeah, as far as I can tell, the issue is just that XML::Twig is verrrrry slow. I'm not sure what to do about that - I don't really have the time or the energy to rewrite this module with a different backend (and I'd be pretty worried about introducing a bunch of new parsing errors). Maybe XML::Twig itself could be improved? I'm not really familiar with that codebase at all.

mirod commented 8 years ago

Author of XML::Twig here.

Yes XML::Twig is pretty slow. If you unzip the .xlsx file and parse all the .xml files with XML::Twig, without doing anything with them, you will more or less get the lower limit of the time needed to load the file. That's quite a long time for big files.

That said a quick analysis shows that using namespaces (the map_xmlns option that was added to fix bug 32 I believe) is close to a 50% penalty (ie if you take it out you get 30% better performance). Now of course bug 32 reappears.

There is a bit of hope though:

actually most of the hit comes from using thekeep_original_prefix option, so if you find a way to avoid using it, you gain back nearly all of the speed (in my tests on a big file, you go from 3mn30 to 2mn15). ATM removing the option breaks the test for bug 32, but I think it would be worth looking into it.
if this doesn't work, you can try getting rid of the xml namespace processing all together (at least within XML::Twig): the namespace declarations are all in the root element, I haven't seen any anywhere else. So it should be possible to use a start_tag_handlers on the root to get the actual prefixes being used (including none for the default namespace), and then set the handlers for the twig, using the actual prefix (or none)

An other thing: once I remove xml namespace processing, NYTProf gives me Spreadsheet::ParseExcel::Utility::ExcelFmt as the top function in which the code spends time I suspect this may vary with the type of data but at least that's 1 data point.

Other hot functions are: Spreadsheet::ParseExcel::Utility::ExcelLocaltime and Spreadsheet::ParseExcel::FmtDefault::ValFmt.

Also the handler fors:sheetData/s:row. In this one you may want to use $cell->first_descendant( 's:t') instead of ($cell->find_nodes( './/s:t'))[0], and if you don't mind being a bit dirty, use direct hash access to attributes ( $row_elt->{att}->{r} instead of $row_elt->att( 'r')).

Let me know if I can be of more help.

internationils commented 7 years ago

I'm reading a 12-15MB file, and it's slow as well. It is slow at the my $xls = $parser->parse($infile); line, the reading itself is not much different from CSV reading. Would it be possible to speed that up? One possibility might be ignoring formats (i.e. I only want the raw data, no number etc. formats, alignments, etc.), so maybe some of the load could be pushed back to ->value() vs ->unformatted() ?

doy / spreadsheet-parsexlsx

quest for speed ... #65