Closed flaushi closed 2 years ago
Most of the open issues involve the hardcoded V8 string length limit (a limit that is unaffected by max-old-space-size
and other tuning parameters) and will also hit limits in NodeJS and the CLI tool.
The CLI tool sets the option dense: true
in the XLSX.readFile
call:
var dense_wb = XLSX.readFile("test.xlsx", { dense: true });
It is an (currently) undocumented option and will change the worksheet type. As long as you stick to the API functions and avoid directly addressing cells like ws["A3"]
, then you should be able to set the same option. The actual structure of dense workbooks will change at some point, so it's best to stick to functions like sheet_add_aoa
to write data into the worksheet and sheet_to_json
to extract data from the worksheets.
.
More details below:
The current worksheet structure is a large object where each key is an address. Various V8 changes over the years degraded performance of this access pattern (performance has actually regressed). We've reported bugs to NodeJS and to Chromium, including https://bugs.chromium.org/p/v8/issues/detail?id=6696
dense
changes the structure to an array of arrays of cell objects.
.
Array of Arrays should have been the obvious choice. There were two early decisions that led to the current situation: "sparse worksheets" and "web workers".
"sparse worksheets": Imagine a worksheet that only sets A1 and XFD1048576. In the sparse structure, this creates an object with a "A1" key and a "XFD1048576" key. In the dense structure, to ensure the bottom cell could be stored, the worksheet is an array of length 1048576 and the last element is an array of length 16384. This just crashed IE in our testing :(
"web workers": Back in 2012 (It's been nearly 10 years since the first commit!) Web Workers and File API were relatively new. We had explored the idea of passing the binary data (be it a binary string or a Uint8Array) to the worker and passing back the worksheet object. We encountered issues trying to pass back the arrays in IE10 and ultimately concluded that it was safer to send simple objects in Worker messages.
.
IE is long gone, so why not flip the default? There are a few reasons:
"relevance": Photoshop and InDesign CC 2022 suffer the same problem.
"fallbacks": We looked into a migration using Proxy
(underlying dense worksheet with a proxy that lets you access cells by address) but proxies are still incredibly inefficient in 2022.
"change": Changing defaults is hard, especially given the size of the developer base. We've recently revisited all of the demos (https://docs.sheetjs.com/docs/demos/) and quite a few of them didn't work in recent testing because some project decided to abandon CJS or changed APIs in ways that required rewriting the demo from scratch.
Thanks a lot for this verbose response. I can confirm that the dense
parameter did solve my problem. I have not tested what my new limit is, but for the time being I am fine.
Actually my list of contracts is indeed dense, so this representation is ideal. The mentioned sheet with only two far distant non-empty cells is an edge case. One might think about setting dense to true by default.
Thanks!
There are numerous issues on the problem of loading large XLSX's, with partly contradicting details. My view of the problem:
XLSX.readFile(filename, options);
that just does not return then--max-old-space-size=10096
does not helpOpen questions: