rzymek commented 5 years ago

Currently new ReadableWorksheet(InputStream) will read the whole uncompressed xml data into memory. This is how OPCPackage.open(InputStream) works.

It would be great to make fastexcel-reader be able to stream rows as it reads the input stream.

Usually the order of xml files in the xlsx archive is as follows:

-rw----     2.0 fat      571 bl defN 17-Aug-16 12:30 _rels/.rels
-rw----     2.0 fat      271 bl defN 17-Aug-16 12:30 docProps/app.xml
-rw----     2.0 fat      588 bl defN 17-Aug-16 12:30 docProps/core.xml
-rw----     2.0 fat      549 bl defN 17-Aug-16 12:30 xl/_rels/workbook.xml.rels
-rw----     2.0 fat     2725 bl defN 17-Aug-16 12:30 xl/sharedStrings.xml
-rw----     2.0 fat    43085 bl defN 17-Aug-16 12:30 xl/worksheets/sheet1.xml
-rw----     2.0 fat     5637 bl defN 17-Aug-16 12:30 xl/styles.xml
-rw----     2.0 fat      716 bl defN 17-Aug-16 12:30 xl/workbook.xml
-rw----     2.0 fat     1111 bl defN 17-Aug-16 12:30 [Content_Types].xml

This is great and would allow processing on the fly. The zip could be read using ZipInputStream. Shared string table would be created from sharedStrings.xml when it would be encountered. Then rows would be emitted to the user are they are read from sheet1.xml. In this mode accessing sheets would only be allowed in order in which they appear in the archive.

There is one problematic case though. I have already came across an xlsx (saved from MS Excel) where xl/sharedStrings.xml appeared after xl/worksheets/sheet1.xml, like this:

-rw----     2.0 fat      571 bl defN 17-Aug-16 12:30 _rels/.rels
-rw----     2.0 fat      271 bl defN 17-Aug-16 12:30 docProps/app.xml
-rw----     2.0 fat      588 bl defN 17-Aug-16 12:30 docProps/core.xml
-rw----     2.0 fat      549 bl defN 17-Aug-16 12:30 xl/_rels/workbook.xml.rels
-rw----     2.0 fat    43085 bl defN 17-Aug-16 12:30 xl/worksheets/sheet1.xml
-rw----     2.0 fat     2725 bl defN 17-Aug-16 12:30 xl/sharedStrings.xml
-rw----     2.0 fat     5637 bl defN 17-Aug-16 12:30 xl/styles.xml
-rw----     2.0 fat      716 bl defN 17-Aug-16 12:30 xl/workbook.xml
-rw----     2.0 fat     1111 bl defN 17-Aug-16 12:30 [Content_Types].xml

I do hope xl/_rels/workbook.xml.rels always appear before sheet and sharedStrings. This would at least allow for detection of this case: If sharedString.xml is specified in rels and sheet.xml is encountered before sharedString in the zip. The only one solution that comes to my mind. Put aside the raw compressed sheet1.xml part of the input stream to temporary file. Then when sharedString.xml is read from input stream, resume uncompressing sheet1.xml and processing in on the fly then.

Further possible optimizations:

Skip xlsx entries that we're not parsing anyway without uncompressing them (docProps/app.xml, xl/styles.xml)
Instead of saving compressed sheet.xml to temp file, start with putting is aside in memory. Then save to file only after some threshold is reached.
Fully skip sheets.xml that the user is not interested in. For example: Using InputStream source; The user asks for sheet3.xml (that is after name to id resolution). sheet1 and sheet2 are skipped, uncompressed when reading the InputStream. Only when sheet3.xml is encounted, it is processed and rows streamed to the users. Accessing sheet1 or sheet2 after that would be not possible.

Simpler alternative:

Load the whole InputStream (compressed xlsx) into memory. Then specific parts like sharedString.xml or sheet3.xml could be accessed using the zip's central directory that is located at the end of the archive. (see "Zip file structure" in https://rzymek.github.io/post/excel-zip64/). Maybe OPCPackage has a mode that works this way already. OPCPackage.open(ZipEntrySource)? If not, a contribution to OPCPackage might be a better place for improvement.

What do you think?

rzymek commented 5 years ago

"Simpler alternative" implemented in https://github.com/dhatim/fastexcel/pull/68

ochedru commented 5 years ago

68 merged: this is the simplest approach for now.

To continue the discussion, relying on the physical order of files within the archive is probably dangerous. I guess the classical use case for streaming is: we have a huge workbook and want to read and process all or some data from it. Obviously, we want to use as few memory as possible. Instead of sticking to read the InputStream once and implementing tricks around that, what if we could get a new InputStream as needed (through a Supplier<InputStream>?), i.e. read the source file multiple times? Usually, the workbook is read from a file or from some database, so we can read it as many times as we want: we could trade more IOs for less memory usage.

rzymek commented 5 years ago

I don't mean to rely on order in the zip. Merely to take advantage of the usual order. Mind that I came across only one Excel installation that save xlsx with sharedString after sheet.xml. I suspect it had to do with big sheet saved on a low memory machine (~500Kx120 on 8GB). Again, fastexcel must handle any order. But could that advantage when sharedSting is before sheet.

The cases you're mentioning are already handled with new ReadableWorkbook(File). It does have a low memory footprint (almost 0 after gc - see left side of this graph). I'm thinking about a case where the sheet is uploaded via network. Like importing data into a web application. I'd love to be able to insert the data to db from xlsx as they are being received over the net. I just wander if it's possible without writing own zip reader implementation. I'll have to browse through commons-compress.

rzymek commented 5 years ago

I did a test on skipping zip entries without uncompressing.
Test file: 2.4GB zip file of random data (level default, compression ratio 0%), 3 files inside. Extract only file 4th file.
Results:

FileInputStream - just read the whole file (baseline): 530ms
ZipFile: 600ms
Skipping entries on the stream without uncompression (prototype): 1s80ms
common-compress: 1s400ms
java.util.ZipInputStream: 3s200ms

Conclusions:

commons-compress seems to be doing the skips without uncompression. It looks like calling getNextEntry() one after the other does not uncompress the entry. (There's no closeEntry() method in commons-compress)
java.util.ZipInputStream does seem to uncompress entries even when closeEntry() is called right after getNextEntry()
ZipFile is the best. It probably does the best it can having random access to the data. That is: go to end, get requested entry offset, seek to that position, read and uncompress only that entry.
custom prototype is not that much better then commons-compress. It's rather not worth while.
There's not that much time to be gained. xlsx files are not in GB. A million row, one sheet xlsx is about 300mb. Results for 350mb file:
- ZipInputStream: 450ms
- commons-compress: 220ms
- ZipFile: 110ms
- custom prototype: 170ms

For a typical 30mb file, the time are in range: 20ms-70ms

rzymek commented 5 years ago

After thoughts, I don't think it is worth perusing this optimization. Current solution is much safer, only at the expense of higher, but reasonable memory usage. I'm going to peruse another optimization in the writer, now.

dhatim / fastexcel

fully streaming reader #67

Further possible optimizations:

Simpler alternative:

68 merged: this is the simplest approach for now.