dhatim / fastexcel

Generate and read big Excel files quickly
Other
684 stars 122 forks source link

fully streaming reader #67

Closed rzymek closed 5 years ago

rzymek commented 5 years ago

Currently new ReadableWorksheet(InputStream) will read the whole uncompressed xml data into memory. This is how OPCPackage.open(InputStream) works.

It would be great to make fastexcel-reader be able to stream rows as it reads the input stream.

Usually the order of xml files in the xlsx archive is as follows:

-rw----     2.0 fat      571 bl defN 17-Aug-16 12:30 _rels/.rels
-rw----     2.0 fat      271 bl defN 17-Aug-16 12:30 docProps/app.xml
-rw----     2.0 fat      588 bl defN 17-Aug-16 12:30 docProps/core.xml
-rw----     2.0 fat      549 bl defN 17-Aug-16 12:30 xl/_rels/workbook.xml.rels
-rw----     2.0 fat     2725 bl defN 17-Aug-16 12:30 xl/sharedStrings.xml
-rw----     2.0 fat    43085 bl defN 17-Aug-16 12:30 xl/worksheets/sheet1.xml
-rw----     2.0 fat     5637 bl defN 17-Aug-16 12:30 xl/styles.xml
-rw----     2.0 fat      716 bl defN 17-Aug-16 12:30 xl/workbook.xml
-rw----     2.0 fat     1111 bl defN 17-Aug-16 12:30 [Content_Types].xml

This is great and would allow processing on the fly. The zip could be read using ZipInputStream. Shared string table would be created from sharedStrings.xml when it would be encountered. Then rows would be emitted to the user are they are read from sheet1.xml. In this mode accessing sheets would only be allowed in order in which they appear in the archive.

There is one problematic case though. I have already came across an xlsx (saved from MS Excel) where xl/sharedStrings.xml appeared after xl/worksheets/sheet1.xml, like this:

-rw----     2.0 fat      571 bl defN 17-Aug-16 12:30 _rels/.rels
-rw----     2.0 fat      271 bl defN 17-Aug-16 12:30 docProps/app.xml
-rw----     2.0 fat      588 bl defN 17-Aug-16 12:30 docProps/core.xml
-rw----     2.0 fat      549 bl defN 17-Aug-16 12:30 xl/_rels/workbook.xml.rels
-rw----     2.0 fat    43085 bl defN 17-Aug-16 12:30 xl/worksheets/sheet1.xml
-rw----     2.0 fat     2725 bl defN 17-Aug-16 12:30 xl/sharedStrings.xml
-rw----     2.0 fat     5637 bl defN 17-Aug-16 12:30 xl/styles.xml
-rw----     2.0 fat      716 bl defN 17-Aug-16 12:30 xl/workbook.xml
-rw----     2.0 fat     1111 bl defN 17-Aug-16 12:30 [Content_Types].xml

I do hope xl/_rels/workbook.xml.rels always appear before sheet and sharedStrings. This would at least allow for detection of this case: If sharedString.xml is specified in rels and sheet.xml is encountered before sharedString in the zip. The only one solution that comes to my mind. Put aside the raw compressed sheet1.xml part of the input stream to temporary file. Then when sharedString.xml is read from input stream, resume uncompressing sheet1.xml and processing in on the fly then.

Further possible optimizations:
Simpler alternative:

Load the whole InputStream (compressed xlsx) into memory. Then specific parts like sharedString.xml or sheet3.xml could be accessed using the zip's central directory that is located at the end of the archive. (see "Zip file structure" in https://rzymek.github.io/post/excel-zip64/). Maybe OPCPackage has a mode that works this way already. OPCPackage.open(ZipEntrySource)? If not, a contribution to OPCPackage might be a better place for improvement.

What do you think?

rzymek commented 5 years ago

"Simpler alternative" implemented in https://github.com/dhatim/fastexcel/pull/68

ochedru commented 5 years ago

68 merged: this is the simplest approach for now.

To continue the discussion, relying on the physical order of files within the archive is probably dangerous. I guess the classical use case for streaming is: we have a huge workbook and want to read and process all or some data from it. Obviously, we want to use as few memory as possible. Instead of sticking to read the InputStream once and implementing tricks around that, what if we could get a new InputStream as needed (through a Supplier<InputStream>?), i.e. read the source file multiple times? Usually, the workbook is read from a file or from some database, so we can read it as many times as we want: we could trade more IOs for less memory usage.

rzymek commented 5 years ago

I don't mean to rely on order in the zip. Merely to take advantage of the usual order. Mind that I came across only one Excel installation that save xlsx with sharedString after sheet.xml. I suspect it had to do with big sheet saved on a low memory machine (~500Kx120 on 8GB). Again, fastexcel must handle any order. But could that advantage when sharedSting is before sheet.

The cases you're mentioning are already handled with new ReadableWorkbook(File). It does have a low memory footprint (almost 0 after gc - see left side of this graph). I'm thinking about a case where the sheet is uploaded via network. Like importing data into a web application. I'd love to be able to insert the data to db from xlsx as they are being received over the net. I just wander if it's possible without writing own zip reader implementation. I'll have to browse through commons-compress.

rzymek commented 5 years ago

I did a test on skipping zip entries without uncompressing.
Test file: 2.4GB zip file of random data (level default, compression ratio 0%), 3 files inside. Extract only file 4th file.
Results:

Conclusions:

For a typical 30mb file, the time are in range: 20ms-70ms

rzymek commented 5 years ago

After thoughts, I don't think it is worth perusing this optimization. Current solution is much safer, only at the expense of higher, but reasonable memory usage. I'm going to peruse another optimization in the writer, now.