Closed rzymek closed 5 years ago
"Simpler alternative" implemented in https://github.com/dhatim/fastexcel/pull/68
To continue the discussion, relying on the physical order of files within the archive is probably dangerous.
I guess the classical use case for streaming is: we have a huge workbook and want to read and process all or some data from it. Obviously, we want to use as few memory as possible. Instead of sticking to read the InputStream
once and implementing tricks around that, what if we could get a new InputStream
as needed (through a Supplier<InputStream>
?), i.e. read the source file multiple times? Usually, the workbook is read from a file or from some database, so we can read it as many times as we want: we could trade more IOs for less memory usage.
I don't mean to rely on order in the zip. Merely to take advantage of the usual order. Mind that I came across only one Excel installation that save xlsx with sharedString after sheet.xml. I suspect it had to do with big sheet saved on a low memory machine (~500Kx120 on 8GB). Again, fastexcel must handle any order. But could that advantage when sharedSting is before sheet.
The cases you're mentioning are already handled with new ReadableWorkbook(File)
. It does have a low memory footprint (almost 0 after gc - see left side of this graph).
I'm thinking about a case where the sheet is uploaded via network. Like importing data into a web application. I'd love to be able to insert the data to db from xlsx as they are being received over the net.
I just wander if it's possible without writing own zip reader implementation. I'll have to browse through commons-compress.
I did a test on skipping zip entries without uncompressing.
Test file: 2.4GB zip file of random data (level default, compression ratio 0%), 3 files inside. Extract only file 4th file.
Results:
FileInputStream
- just read the whole file (baseline): 530msZipFile
: 600msjava.util.ZipInputStream
: 3s200msConclusions:
getNextEntry()
one after the other does not uncompress the entry. (There's no closeEntry()
method in commons-compress)java.util.ZipInputStream
does seem to uncompress entries even when closeEntry()
is called right after getNextEntry()
ZipFile
is the best. It probably does the best it can having random access to the data. That is: go to end, get requested entry offset, seek to that position, read and uncompress only that entry.ZipInputStream
: 450msZipFile
: 110msFor a typical 30mb file, the time are in range: 20ms-70ms
After thoughts, I don't think it is worth perusing this optimization. Current solution is much safer, only at the expense of higher, but reasonable memory usage. I'm going to peruse another optimization in the writer, now.
Currently
new ReadableWorksheet(InputStream)
will read the whole uncompressed xml data into memory. This is howOPCPackage.open(InputStream)
works.It would be great to make fastexcel-reader be able to stream rows as it reads the input stream.
Usually the order of xml files in the xlsx archive is as follows:
This is great and would allow processing on the fly. The zip could be read using
ZipInputStream
. Shared string table would be created fromsharedStrings.xml
when it would be encountered. Then rows would be emitted to the user are they are read fromsheet1.xml
. In this mode accessing sheets would only be allowed in order in which they appear in the archive.There is one problematic case though. I have already came across an xlsx (saved from MS Excel) where
xl/sharedStrings.xml
appeared afterxl/worksheets/sheet1.xml
, like this:I do hope
xl/_rels/workbook.xml.rels
always appear before sheet and sharedStrings. This would at least allow for detection of this case: If sharedString.xml is specified in rels and sheet.xml is encountered before sharedString in the zip. The only one solution that comes to my mind. Put aside the raw compressed sheet1.xml part of the input stream to temporary file. Then when sharedString.xml is read from input stream, resume uncompressing sheet1.xml and processing in on the fly then.Further possible optimizations:
docProps/app.xml
,xl/styles.xml
)InputStream
source; The user asks for sheet3.xml (that is after name to id resolution). sheet1 and sheet2 are skipped, uncompressed when reading theInputStream
. Only when sheet3.xml is encounted, it is processed and rows streamed to the users. Accessing sheet1 or sheet2 after that would be not possible.Simpler alternative:
Load the whole
InputStream
(compressed xlsx) into memory. Then specific parts like sharedString.xml or sheet3.xml could be accessed using the zip's central directory that is located at the end of the archive. (see "Zip file structure" in https://rzymek.github.io/post/excel-zip64/). MaybeOPCPackage
has a mode that works this way already. OPCPackage.open(ZipEntrySource)? If not, a contribution to OPCPackage might be a better place for improvement.What do you think?