SheetJS / sheetjs

📗 SheetJS Spreadsheet Data Toolkit -- New home https://git.sheetjs.com/SheetJS/sheetjs
https://sheetjs.com/
Apache License 2.0
35.08k stars 8k forks source link

XLSX.write() taking too long #77

Open irfanyounas opened 10 years ago

irfanyounas commented 10 years ago

Hi, I have tried to write 3 columns and N number of rows, below we can see the time taken (on my i3 machine):

N --------------- TIME (s) 10000 --------------- 9 20000 --------------- 35 30000 --------------- 90 65000 --------------- 453

We can see that as we increase the number of rows, the time taken by the write() function increases dramatically (its not linear), its taking too long for 65000 records. Can we have some other way which is efficient?

SheetJSDev commented 10 years ago

I am not happy with the performance at all (granted, our first goal was to be correct). There are quite a few slow/inefficient operations that can be improved.

When I last investigated, the runtime was dominated by functions from the ZIP library (we are currently using https://github.com/Stuk/jszip), so I'm working on a new component right now (https://github.com/SheetJS/js-crc32 and https://github.com/SheetJS/js-adler32 are the first few pieces of the puzzle).

irfanyounas commented 10 years ago

Thanks, good work. I hope so we will have efficient solution soon.

SheetJSDev commented 10 years ago

@irfanyounas just so we are on the same page, can you share the script you used?

irfanyounas commented 10 years ago

I am using the script give below: src="//cdnjs.cloudflare.com/ajax/libs/xlsx/0.7.6-g/xlsx.core.min.js"

For creating xlsx file, I have followed the following example: http://sheetjs.com/demos/writexlsx.html

SheetJSDev commented 10 years ago

@irfanyounas how did you create a JS object with 65000 rows? Do you have numbers / text / dates? Are there missing cells (in the example, C2 is missing) or did you fill every cell?

SheetJSDev commented 10 years ago

@irfanyounas Here's an example writing numbers in a 3x100001 file: http://sheetjs.com/demos/100001.html

Dates are printed at 3 times:

On Chrome 35 it takes 4 seconds to process a much larger test case than what you showed.

irfanyounas commented 10 years ago

Thanks, I have 3 'text' (string) columns. The second column is empty sometimes. I have used the following script:

function writeToXLSX(data) {
    var ws_name = "AllMembers",
          wb,
          wbout;
    wb = new Workbook(), ws = sheet_from_array_of_json(data, ["address", "name", "status"]);
    wb.SheetNames.push(ws_name);

    wb.Sheets[ws_name] = ws;
    wbout = XLSX.write(wb, {bookType:'xlsx', bookSST:true, type: 'binary'});
    postMessage({t:"data", d:wbout});
}

function Workbook() {
    if(!(this instanceof Workbook)) return new Workbook();
    this.SheetNames = [];
    this.Sheets = {};
}

function stringToArrayBuffer(s) {
    var buf = new ArrayBuffer(s.length);
    var view = new Uint8Array(buf);
    for (var i=0; i!=s.length; ++i) view[i] = s.charCodeAt(i) & 0xFF;
    return buf;
}

function sheet_from_array_of_json(data, columnNames) {
    var ws = {}, jsonObj, cell, cell_ref;
    var range = {s: {c:0, r:0}, e: {c:0, r:0 }};
    for(var R = 0; R != data.length; ++R) {
                jsonObj = data[R];

                if(range.e.r < R) range.e.r = R;
                for(var C = 0; C != columnNames.length; ++C) {

                        if(range.e.c < C) range.e.c = C;

                        if (jsonObj.hasOwnProperty(columnNames[C])) {
                cell = {v: jsonObj[columnNames[C]]};
                        } else {
                continue;
                        }
            if(cell.v == null)  {
                            continue; 
                        }

            cell_ref = XLSX.utils.encode_cell({c:C,r:R});

            if(typeof cell.v === 'number') cell.t = 'n';
            else if(typeof cell.v === 'boolean') cell.t = 'b';
            else if(cell.v instanceof Date) {
                cell.t = 'n'; cell.z = XLSX.SSF._table[14];
                cell.v = datenum(cell.v);
            }
                    else cell.t = 's';

            ws[cell_ref] = cell;
        }
    }
    ws['!ref'] = XLSX.utils.encode_range(range);
    return ws;
}
SheetJSDev commented 10 years ago

@irfanyounas http://sheetjs.com/demos/write_num.html?n=1000000 is a more general demo (you can specify the number of rows via the n parameter). The runtime appears to be dominated by generating the source array-of-arrays (not from the process that converts it to a workbook, and not from the process that generates the XLSX file).

Here are the times I see using the aforementioned page:

Number of Rows Time
100K 4 sec
200K 8 sec
300K 15 sec
400K 18 sec
500K Chrome crashes ...
SheetJSDev commented 10 years ago

@irfanyounas just saw your last message -- where are you measuring the times? Your code sample doesn't show it. Also, can you put up the main page (it looks like you shared only the worker logic)

irfanyounas commented 10 years ago

Thanks, I am looking into it, may the problem is on my side. Thanks for your help

SheetJSDev commented 10 years ago

If you are using web workers, I recommend also timing the process in the main thread (avoid we workers altogether). FWIW I found in some cases that transferring data to and from the worker is significantly slower than just doing the work in the main thread.

irfanyounas commented 10 years ago

I have tested it using different kinds of data:

Scenario 1: I create a list of 20,000 records, where each records is a dict as given: {"status": "active", "name": "", "address": "sms:+45705846525"}. In this scenario all the records are same (copy of the dict shown above).

Scenario 2: The list of 20,000 records, where each recordsis a dict but each record is different.

Now runtime for Senario 1 is 4 sec, while for Scenario 2, it is around 14 sec.

As per my understanding, May be zip module is taking long due to different records (not same text) in Scenario 2. Can you please try this scenario?

On Tue, Jul 1, 2014 at 2:11 PM, SheetJSDev notifications@github.com wrote:

If you are using web workers, I recommend also timing the process in the main thread (avoid we workers altogether). FWIW I found in some cases that transferring data to and from the worker is significantly slower than just doing the work in the main thread.

— Reply to this email directly or view it on GitHub https://github.com/SheetJS/js-xlsx/issues/77#issuecomment-47690723.

SheetJSDev commented 10 years ago

Here's a good example: http://sheetjs.com/demos/write_text.html?n=8000. The length of the data in the B column increases as you go down. I do see a massive performance issue

irfanyounas commented 10 years ago

Yes, though we can keep the length of the data in B column fixed, the performance issue will also be there if the data is different, e.g, in my case the data looks like 'sms:+45704055659', 'sms:+457683738392', 'sms:+457049821521' and so on.....

SheetJSDev commented 10 years ago

I did some more testing by adding some timing statements (simple console.log(new Date().getTime()) all over the place) and about 90% of the runtime is in the zip.generate operation, which suggests that it should be replaced.

irfanyounas commented 10 years ago

Yes that's right, may be, we can try different zip modules and see the performance. Thanks

Mithgol commented 10 years ago

@SheetJSDev Not to divert you from the main topic, but still: it's easier to type console.time and console.timeEnd instead of explicit console.log(new Date().getTime()).

SheetJSDev commented 10 years ago

@Mithgol this is really neat! I think the MDN article needs to be updated: https://developer.mozilla.org/en-US/docs/Web/API/console.time claims that IE11 supports it, but https://developer.mozilla.org/en-US/docs/Web/API/console.timeEnd has a ? for IE

Mithgol commented 10 years ago

MDN is a wiki. If you think that it needs to be updated, update it.

(I wound have updated it myself, but for some reason I'm on Windows XP currently and IE11 is not installable on such an old OS.)

SheetJSDev commented 10 years ago

@Mithgol updated :) IE11 actually shows the elapsed time down to the 100-nsec level (milliseconds + 4 decimal places)

sandroboehme commented 10 years ago

When I export a big amount of rows in Firefox I get a dialog saying something like "A script is broken or doesn't respond. [Debug Script] [Stop Script] [Continue Script]"

No matter how fast the export will be, by exporting even more data it will always be possible to get this message.

If Firefox gets some time slices for processing by adding window.setTimeout(function(){...},milliseconds); it should solve the isse. I know that it adds time to the export. But as a user I understand that it is a long running task to export this big amount of data. If I provide the user with a message saying that an export is in progress he knows why the UI is blocked. I would be even better if it would be possible to cancel a running export.

SheetJSDev commented 10 years ago

@sandroboehme you can perform the write process in a WebWorker (which wouldn't lock up the UI). For example, http://oss.sheetjs.com/js-xlsx/ (it's hosted from the gh-pages branch ) uses a web worker to read data.

The main reader demo (http://oss.sheetjs.com/, source https://github.com/SheetJS/SheetJS.github.io) shows a spinner when the read process may take a while)

irfanyounas commented 10 years ago

Regarding XLSX write: I have debugged the code and found out that inside function write_ws_xml_cell(), the following line is creating the performance issue:

v = writetag('v', ''+get_sst_id(opts.Strings, cell.v));

Specifically the following function : get_sst_id(opts.Strings, cell.v)

SheetJSDev commented 10 years ago

@irfanyounas That function performs a scan in the shared string table. That behavior can be disabled by setting the option bookSST to false.

I agree that a linear scan is horrible here, but from my testing, it contributes less than 5% to the runtime at 20K rows. It does, however, explain the superlinear runtime

irfanyounas commented 10 years ago

If the text data is same in all rows then its true that it will be fast, but if you have unique data in each row then the performance will be poor. For example you can test it by creating data using the following code: var data1 = [];

while(data1.length < 20000) { var randomnumber = Math.floor(Math.random() * (799999999 - 700000000 + 1)) + 700000000; var sms = "sms:+45"+randomnumber; data1[data1.length] = ["active", "abc", sms]; }

irfanyounas commented 10 years ago

Thanks. I was wondering what if we don't perform a scan in the shared string table? Does it effect the compressed file size?

SheetJSDev commented 10 years ago

@irfanyounas The original problem is that some readers (notably iOS Numbers) had issues with inline strings. The SST was the only way for strings to be displayed. Testing it now, it appears that the newest version of Numbers does support inline strings, so it might make sense to recommend inline strings.

Adding the shared string table is larger for files that don't have repeated elements (for example, with tables that have text labels and numeric bodies) but smaller for files with lots of repeated strings (for example, a table where one column describes a category).

The mystery here (which explains the difference in our understanding) is that as the size of the data increases, the zip time (as a percentage of the total write process) gets progressively worse. Basically, there are two effects: the effect of the linear scan and the effect of having to write a new file in the zip. To see this, add a console.time("zip") just before the switch statement in write_zip and a `console.timeEnd("zip") in your script. I generate data this way:

var w = 2048; // <-- width of each string
var MAXLEN = 10000; // <-- number of rows

var o = new Array(MAXLEN + w);
for(var i = 0; i != MAXLEN + w; ++i) o[i] = String.fromCharCode(i); // safe for values before 0xd000 = 53248
var data = new Array(MAXLEN);
for(var dd = 0; dd != MAXLEN; ++dd) data[dd] = ["foo", o.slice(dd, dd+w).join(""), dd+2];

This ensures v8 doesn't perform some magic and ensures we have distinct strings.

For small w, the zip operations take less than half the total runtime, but as w is increased the zip operations almost completely dominate the runtime (in both cases). According to the v8 profile (you can get this in node by running with the --prof flag and then using the tick processor), the function that takes the most time is the CRC-32 function.

irfanyounas commented 10 years ago

Thanks, that's right. You are doing really good work.

turbobuilt commented 9 years ago

I'm having this issue too. Is there any form of the xls spec that doesn't require zipping? That might be a workaround till the zip issue is taken care of.

turbobuilt commented 8 years ago

The jszip api offers the ability to not compress the zip file. Would that help?

https://stuk.github.io/jszip/documentation/api_jszip/generate.html

kumarr10 commented 2 years ago

31 Columns and 300K rows from API , this json_to_sheet/ writeFile breaks. please see if this can addressed.

SheetJSDev commented 2 years ago

@kumarr10 live demo https://jsfiddle.net/xgr4sbk1/

var NR = 300000, NC = 31;
console.time("prep");
var ws = XLSX.utils.aoa_to_sheet(Array.from({length:NR}, (_,i) => Array.from({length: NC}, (_,j) => j == 0 ? `Row${i+1}` : i+j)), {dense: true});
var wb = XLSX.utils.book_new(); XLSX.utils.book_append_sheet(wb, ws, "Sheet1");
console.timeEnd("prep");
console.time("write");
XLSX.writeFile(wb, "out.xlsx");
console.timeEnd("write");

Using chrome 0.18.3 on a 2018 macbook pro it takes ~13 seconds to generate 300K x 31 fully populated worksheet. You can play with the NR / NC variables. The new limit is the string length cap when generating the worksheet xml.

anandmuthu1978 commented 2 years ago

Hi All,

I have struggle to export large rows of data for xlsx export. The data is 100 columns X 100K rows for JSON data. Browser either goes into freeze mode or you will get array length error hitting the string limitation for V8 in chrome.

Following is how I resolve this, so hopefully it will shed some light to some who are having similar issues.

  1. First of all update the version used 0.18.5 includes some performance string conversion.
  2. I used json_to_sheet initially to convert to a worksheet. This always fail for large data most likely hitting a limitation in chrome or v8. Unable to solve this. No errors was raised in chrome and browser tab goes into infinite loop. This was resolved by changing the JSON object mapping to direct array mapping. 3 Example. JSON data = { {a:111 , b: 333, c: 444} {a:111 , b: 333, c: 444} } Convert to data = [ [111,333,444] , [111,333,444] ]
  3. Hence converted from json to arrray of arrays. Using aoa_to_sheet worked great. Use the option dense : true and raw : true. 100K X 100 columns array changed to worksheet in like 10s.
  4. The next problem was writeFile, which raises range error in chrome. Googling this tells me the we hit the string limit in chrome. In writeFile, the whole workbook is converted to string before downloading using a string join.
  5. I manage to successfully download by enabling the following options. type : binary bookSST: true compression: true
  6. The most important thing here is the bookSST. Basically, this is like indexing of repeating data in the cells. One of the reason we are hitting the string error is the xml set blank cells to XML preserve = space string in each of the cell. Hence, empty cells now has an xml string of like 20 to 30 characters associated to it. By using bookSST, this will be index and just a reference index key will be populated. This will reduce the total string length.
  7. compression will reduce the file size.
  8. Finally manage to download the file.

These are all based on my understanding of the code and various help from the developers when I raised issues to them. They been a great help. Hopefully this will help others encountering the same problem.

Thanks. Anand Muthu

dtslvr commented 2 years ago

"bookSST": true, "compression": true

Thanks for your explanations @anandmuthu1978!

SheetJSDev commented 2 years ago

Thanks for sharing @anandmuthu1978 ! bookSST: true is helpful for files containing a large number of repeated strings (text labels for each row, for example) and compression: true obviously reduces the file size, but the tradeoff is increased runtime. Using type: "array" generates a Uint8Array rather than a string (which should be more efficient)

devupmaster commented 2 years ago

@anandmuthu1978 hello

Could you give me an example how you managed to solve to write a large number of lines?

With my code, I can save up to 60k lines. With 70k I couldn't understand why it saves only 15 lines, and it doesn't show any error.

try {

  const ws = xlsx.utils.json_to_sheet(data)
  const wb = xlsx.utils.book_new()

  xlsx.utils.book_append_sheet(wb, ws)

  await xlsx.writeFileSync(wb, fileName)

} catch (e) {

  console.log(e)
}

thks

anandmuthu1978 commented 2 years ago

@goproclube hi. In my case i converted the data from json to array and used aoa_to_sheet(data). Most likely in your case 70k is already hitting the memory limit. I noticed json_to_sheet takes a lot of memory so I converted it to array format. You will lose the data mapping association but as long you want all it doesnt really matter. Example. JSON data = { {a:111 , b: 333, c: 444} {a:111 , b: 333, c: 444} } Convert to data = [ [111,333,444] , [111,333,444] ]

farideliyev commented 2 years ago

@SheetJSDev is it possible, when writing big files to .xlsx, somehow append worksheets together? I mean, imagine I have 700k rows, my CHUNK_SIZE = 350k, I create worksheet, append it to workbook and pass it to XLSX.write and get typed array back. On my next step I repeat the same steps, but now with another 350k piece of data. At the end I get 2 typed Arrays. My question is it possible to merge worksheets together in order to get them in one file. If so, we will be able to pass chunked data to xlsx and zipping manipulations will not take so long time.

chy869 commented 2 years ago

thx, anandmuthu1978

I face the performance issue using json_to_sheet

problem resolved by using aoa_to_sheet,

Sample code below for reference

  const wb = xlsx.utils.book_new();
  const outputData = records.map( Object.values );
  outputData.unshift(Object.keys(records[0]));
  const ws = xlsx.utils.aoa_to_sheet (outputData,{dense: true});
  xlsx.utils.book_append_sheet(wb, ws);
  xlsx.writeFile(wb, destination,{compression:true});
iceshu commented 10 months ago
//ERROR no dense this option;
const ws = utils.aoa_to_sheet([], { dense: true }); 
// the typing file is below
aoa_to_sheet<T>(data: T[][], opts?: AOA2SheetOpts): WorkSheet;

export interface AOA2SheetOpts extends CommonOptions, DateNFOption {
    /**
     * Create cell objects for stub cells
     * @default false
     */
    sheetStubs?: boolean;
}

how to resolve it?