OfficeDev / office-js

A repo and NPM package for Office.js, corresponding to a copy of what gets published to the official "evergreen" Office.js CDN, at https://appsforoffice.microsoft.com/lib/1/hosted/office.js.
https://learn.microsoft.com/javascript/api/overview
Other
670 stars 96 forks source link

Excel - Slow performance when writing data into Table #12

Closed LanceEa closed 5 years ago

LanceEa commented 7 years ago

Summary

I'm creating a table in Excel using the new Promise based API's and inserting data into it and the performance isn't what I would have expected.

const t0 = performance.now();

// add new table, set name and headers
 myTable = sheet.tables.add(tableRange, true);
 myTable.name = tableName;
 myTable.getHeaderRowRange().values = [headers];

// add data if exists
if (data.length > 0) {
 myTable.rows.add(null, data);
}
// activate sheet
sheet.activate();

await context.sync()
 .then(() => {
      const t1 = performance.now();
      console.log(`Took ${(t1 - t0).toFixed(4)} milliseconds to generate: ${tableName}`);
});

Expected Behavior

I would expect that 2000 records would be fast and only take 1 second or so and that I would be able to insert significantly more data. We are in the early phases of developing our add-in so we are using small datasets but in the future we want to support 100's of thousands of records. I recognize to insert that many records we will need to probably chunk it up and insert in batches but I would think those batches could be bigger than 2000 records???

Current Behavior

When I create a table and insert 100 records it is snappy and feels fast. Less than 1 second. However, when I load 2,000 records and 20 columns it is much slower. At worst I have seen 20 seconds and at best like 5 seconds.

Steps to Reproduce, or Live Example

See code snippet above. It is basically the same as what is provided in ScripLab but the data is just 2000 records x 20 columns

Context

As mentioned above, I would expect 2000 records x 20 columns to be faster than it is because I want to be able to allow users to download large datasets (100,000 records+). Assuming, we can provide a good user experience.

Your Environment

Useful logs

I'm adding some Application Insight events right now and will update issue with some performance observations.

Questions:

  1. Is my code correct or is there a better way to be doing this?
  2. Are my expectations correct and inline with what I should be expecting?
  3. What is a good batch size for inserting data into a table? Records, columns or by number of cells?
LanceEa commented 7 years ago

Updates

Tested on Windows 10

Attempted 50k insert

Chunking inserts into Table

My next steps are:

hongbo-miao commented 7 years ago

Hi @LanceEa thanks for the info. I am looking into this. I will try to reproduce first. If you can provide me a repo, that will help me save some time.

LanceEa commented 7 years ago

@Hongbo-Miao - Thanks!

Here is a Gist I created that can be used with Script Lab. I mocked data for 17 columns and 2000 records. It takes about 12 seconds which is in line with what I was seeing in my application.

https://gist.github.com/LanceEa/da13ce3f425941e43a6472c957765db6

screen shot 2017-09-19 at 4 27 38 pm screen shot 2017-09-19 at 4 28 41 pm
LanceEa commented 7 years ago

@Hongbo-Miao - Also, I was able to test an alternative method today:

const t0 = performance.now();

// add new table, set name and headers
 myTable = sheet.tables.add(tableRange, true);
 myTable.name = tableName;
 myTable.getHeaderRowRange().values = [headers];

// Adding data to the the RowCollection  seems slow. 
// But adding data to a Range within the table is much faster
 myTable.getBodyRange().values = data;

// activate sheet
sheet.activate();

await context.sync()
 .then(() => {
      const t1 = performance.now();
      console.log(`Took ${(t1 - t0).toFixed(4)} milliseconds to generate: ${tableName}`);
});

Doing it this way I was able to scale it all the way to 100k records with 20 columns and it took about 1 minute 30 seconds. 50k records was about 40-50 seconds.

This is better since it doesn't crash excel but I still feel as though that is slow...Are the timings I'm finding within Microsoft's expectations for the Office-Js API's? Is this to be expected?

hongbo-miao commented 7 years ago

Hi @LanceEa the gist is not complete. It ends at line 26854. Maybe create a repo on GitHub?

image

LanceEa commented 7 years ago

Sorry about that...looks like ScriptLab or Gist doesn't like such a big file :(.

Here is a quick sample I threw together. https://github.com/LanceEa/Excel-Table-Perf

Steps to run it:

  1. Clone Repository
  2. run from command-line npm install
  3. Serve the app using the command ng serve --ssl

Example 1 is the way that Script Lab and the Office-js Docs show how to create and data to a table.

Example 2 is the way that I experimented with today and was able to get "ok" results. Still not great but able to create a large table without Excel crashing.

All the logic can be found in the app.component.ts and the mocked data is in data.ts.

Sorry if you are already familiar with the process. Didn't want to assume that you know angular.

Let me know if you have any other questions or need more information.

Zlatkovsky commented 7 years ago

@lanceea, on a sidenote: feel free to file the bug re. Script Lab on the Script Lab repo. If you can include an attachment of the text file to repro the issue, that would be great.

Thanks!

LanceEa commented 7 years ago

@Zlatkovsky - No problem. I took a second look at it today and I do not think their is a bug but let me confirm with @Hongbo-Miao. If there is a bug I will file it in the Script Lab Repo.

@Hongbo-Miao - I took a second look at the Gist to see what the issue could be and your screenshot appears to be of the file from the Github page, correct? GitHub will truncate large files as you can see it does it here for me too:

I see the same thing you are seeing at the bottom: image

Here is what it says at the top of the page though image

Regardless, the Repo I posted has the same code but I wanted to rule out a bug for Script Lab 😃

hongbo-miao commented 7 years ago

Thanks, I reproduced. I will have a look for the code and what we can improve.

LanceEa commented 6 years ago

@Hongbo-Miao - I just wanted to check-in to see if there was anything else you needed from me?

This really is a show stopper for us so and my team needs to make a decision whether we stick with the Office Addin model or go back to VSTO.

Is this performance inline with what you would expect? Is there something wrong with my code and how I'm loading data into tables? Or, is there potentially a bug that can be addressed here?

Any information on this would be helpful and appreciated.

Thanks!

hongbo-miao commented 6 years ago

@LanceEa Personally, I don't think your code has issue. It is a performance issue from us. If you can let us know what you are trying to achieve, maybe we can figure out a way. I will leave a contact way here tomorrow, or you can direct message me through my twitter then I send you back my email.

LanceEa commented 6 years ago

@Hongbo-Miao - Sounds good. I sent you a follow request on Twitter. You can reach me on twitter at my twitter then we can DM. I can share more details on what we are trying to accomplish.

sameera commented 6 years ago

@Hongbo-Miao Has there been any resolution on this performance issue? We have a use case where we'd need to write tens of thousands of rows in a table. But, even at around 500 records, we are waiting around 2mins just for the append to the table to complete. We tried both table.rows.add method as well as simply trying to write to the range. While the latter seem slightly better, the time taken for the operation is just too long. We have formulas in the rows that we are writing; which could be adding to the problem. Any help on overcoming the problem is appreciated.

tong-1324 commented 6 years ago

Hi @sameera , my name is Tong. I am a software engineer at Office Extestion Platform team, and I am currently investigating the performance issue you mentioned above.

Here's some quick workaround to improve the performance:

  1. Use Range Object to write data, not use Table Object to insert data. Currently there's a significant overhead to maintain the table format when we try to insert/append data into a table. We are working on it to improve the table insertion performance.
  2. Call application.suspendApiCalculationUntilNextSync() before large dataset read/write. We recommend you to use this API to temporally suspend the auto calcatuion when you try to do large dataset read/wirte operation, especially when you have a lot of formulas in the range.

We've already made some progress on our investigation and have addressed some key issues that we are facing. I really appreicate your feedback. Please feel free to let me know if you have any other questions.

tong-1324 commented 6 years ago

Hi @LanceEa , my name is Tong. I am a software engineer at Office Extestion Platform team, and I am currently investigating the performance issue you mentioned above.

According to my colleague Sudhi, you have alreadly tried the quick workarounds I introduced to the other customer above. I know that the API performance still has not reached your expectation yet. I want to let you know that we are actively working on it and we've already made some progress. I will let you know at the first time when we have an update on it. I readlly appreciate your feedback and help!

sameera commented 6 years ago

HI @tong-1324 Thank you for your response. We already use application.suspendApiCalculationUntilNextSync but, have not seen much improvement with. One reason could be that we are writing the data in smaller chunks 25-100. And maybe, because of that the impact of this suspension isn't that noticeable?

As I also mentioned, we tried writing to the Range object. The Range object wasn't significantly better than the Table.rows.add.

We are currently trying out another alternative: which is to use the TableBinding.addRowAsync method. This is working out very well for us at the moment (each append taking less than 1 second). We preferred this over writing to range.formulas because we also have a use case to place different formulas on different rows of the same column. If we do this inside a Table using rows.add or range.formulas, Excel would auto-copy the last formula (if I observed correctly) to the entire column. addRowAsync doesn't seem to have this (undesirable) behavior. It just is a bit weird to be using the old style async API.

sameera commented 6 years ago

I take back what I said about addRowAsync not being affected Calculated Columns behavior. It does behave the same apparently. While unrelated to the original issue @tong-1324 is there no way to turn off Calculated Columns? Perhaps if there was a way to turn off all such table behavior until we are done, that would benefit the write performance as well.

LanceEa commented 6 years ago

@tong-1324 - Thanks! Let me know if there is anything you need me to test.

tong-1324 commented 6 years ago

@sameera Thanks for your feedback! Good to know that you have a temporal workaround on the perf issue.

To follow up on the questions your mention:

  1. In our performance test result, tens of thousands of data write in a Range should be just one or two seconds. Writing the data in smaller chunks could be the reason in your case. There're some known performance issue when we break the data into multiple chunks, and we've already got some solution on that. Do you mind to share some snippet with me about your data chunking so that I could provide more details with you?

  2. To my understanding, TableBinding.addRowAsync shoud be very similar to writing data into Table.getDataBodyRange().Values. I will investigate into it to see what's reason to casue the performance difference on these two methods.

  3. For the formula copy within a column, I think you are right and that's a by degined feature of Excel table. I don't even know there's any way to turn off it in UI currently. Do you think if we keep the workbook in the munual caculation mode will sovle the problem for you?

sameera commented 6 years ago

@tong-1324

  1. Yes, most probable that it's related to chunking. We basically get the data from a REST API and basically write what we get to the sheet. More on this below.

  2. You could do that from Auto Correct options in the UI. Also the behavior if you do this manually is that, Excel doesn't apply the formula you enter across the column, if you enter it manually. While the feature is a by design, I cannot help but feel that there's a bug in here somewhere because of the way it behaves over the API. Can you please take a look at https://github.com/OfficeDev/office-js-docs/issues/1212 and the screengrab linked and share your thoughts on that thread perhaps?

As for the code snippets and some perf data:

Range.formulas Approach:
        appendTable = <T extends TableReference>(table: T, data: any[][]): Common.Thennable<T> => {
            if (!data.length) return OfficeExtension.Promise.resolve(table);

            let perfCounter = Date.now();
            return Excel.run(ctx => {
                let xlTable = ctx.workbook.tables.getItem(table.id);
                let lastRow = xlTable.getDataBodyRange().getLastRow();
                lastRow.load("rowIndex");
                return ctx.sync()
                    .then(() => {
                        let startRow = lastRow.rowIndex > 2 ? lastRow.rowIndex + 1 : 2;
                        let targetRangeAddr = RangeRef.from(startRow, 0, startRow + data.length - 1, data[0].length - 1).getRelativePath();
                        let targetRange = xlTable.worksheet.getRange(targetRangeAddr);
                        targetRange.formulas = data;
                        return ctx.sync();
                    }).then(() => {
                        let span = (Date.now() - perfCounter) / 1000;
                        this.log.info("Time taken for append (s): " + span);
                        return table;
                    });
            }).catch(err => {
                return this.rejectWithFormattedError(err);
            });
        }

addRowsAsync Approach:

        appendTable = <T extends TableReference>(table: T, data: any[][]): Common.Thennable<T> => {
            if (!data.length) return OfficeExtension.Promise.resolve(table);

            let perfCounter = Date.now();
            let d = this.$q.defer<T>();
            Office.context.document.bindings.getByIdAsync(table.key, null, findBindingResult => {
                if (findBindingResult.status === Office.AsyncResultStatus.Succeeded && findBindingResult.value) {
                    let binding: Office.TableBinding = findBindingResult.value;
                    binding.addRowsAsync(
                        data,
                        { coercionType: Office.CoercionType.Table }, 
                        addRowsResult => {
                            if (addRowsResult.status === Office.AsyncResultStatus.Succeeded) {
                                let span = (Date.now() - perfCounter) / 1000;
                                this.log.info("Time taken for append (s): " + span);
                                d.resolve(table);
                            } else {
                                d.reject(addRowsResult.error);
                            }
                        }
                    );
                } else {
                    d.reject(findBindingResult.error);
                }
            });
            return d.promise
        };

Writing 2627 rows in chunks. The chunks are differently sized depending on the data coming from the REST API. So comparing chunk to chunk on the same approach may not be useful. The times logged in seconds are (for selected chunks)

| Chunk #   | Range Approach    | Async Approach
---------------------------------------------
| batch 1   | 5.46          | 0.192
| batch 2   | 12.844        | 0.191
| batch 12  | 126.96        | 0.924
| batch 14  | 134.6         | 1.05
| batch 15  | 160.57        | 1.12
| batch 19  | 256.17        | 1.359
| batch 20  | 278.79        | 1.436

Hope this gives you enough information to troubleshoot

tong-1324 commented 6 years ago

@sameera

Sorry for the late reply. I would like to let you know that we have made some great progress on the performance issue at the past two weeks.

  1. We optimized some critical part in our API pipeline, which saved more than 80% of time on average for each range write operations. We also optimized the way we manage temeporal ranges, which should let you have a better experience when you are trying to chunking data. We are still in the middle of testing and verifying our changes. Most likely we will finalize this changes on December fork, which will be availble for insider fast at January, 2018. I will let you know once we have a confirm on that.

  2. The way you do the data chunking looks good. However, even with all this changes mentioned above, I would still recommend you to get all the data first and write a large range at once, if it is possible. This will save a lot of network/API overhead, and will make sure you get the best experience.

  3. I will start to investigate the issues on the table object like slow performance on talbe.add(), performance differences on addRowsAsync(), etc, once I finish the current work on range. I would also try to see if there's any way to turn off the calculated columns options and to see if that helps the performance.

Thank you very much for your feedback and help with us!

tong-1324 commented 6 years ago

@LanceEa I would like to provide some updates to you about our recent work on the performance. We optimized some critical part in our API pipeline, which saved more than 80% of time on average for each range write operations. We also optimized the way we manage temeporal ranges, which should let you have a better experience when you are trying to chunking data. We are still in the middle of testing and verifying our changes. Most likely we will finalize this changes on December fork, which will be availble for insider fast at January, 2018. I will let you know once we have a confirm on that.

We are also working on creating a public dataset and script sample for benchmark, so that we could better compare the performance experience on the customers' side and our lab environment. And we could have better way to quantitatively describe our improvement work.

malay2012 commented 6 years ago

Hi Tong,

I was trying to write data using Office.select('bindings#' + namedRange, function (callback) {}) and with setDataAsync(). I have 5000 rows along 30 columns and 10+ columns based on formulas. It taking more than 1 min to write data in three different sheets. Is it because of the formula we writing in the excel causing the perf issue.

Please suggest!

Thanks!

LanceEa commented 6 years ago

@tong-1324 - Thanks for the update. Let me know once it is available and I will give it a try!

Danwakeem commented 6 years ago

Has anyone else tried this in Excel online?

I have a similar situation where I am trying to load 60,000+ rows into an excel spreadsheet and when I run it in the Excel client for Mac it works alright but when I run it online it doesn't insert the data into the table on the spreadsheet at all.

LanceEa commented 6 years ago

@tong-1324 @Hongbo-Miao - I just wanted to check-in on this to see what the status of the fixes are? Have they been pushed into the insider-builds?

hongbo-miao commented 6 years ago

Have given the update in the email. Please reply here for more details @tong-1324 when you back.

Zlatkovsky commented 6 years ago

@LanceEa & @Danwakeem & @malay2012 : In the meantime, I've found a workaround that seems to work quite well -- which involves removing the original table, writing in the values, and then re-creating the table back on top. The only issue I can think of is that if you have inter-column formulas within the table, that might cause havoc. But otherwise, please give it a try.

You can import this snippet into Script Lab, and then see both the repro behavior as well as the workaround: https://gist.github.com/Zlatkovsky/d36d977668afde97fd567746dffa5c2b

The salient bit of code is:

        await Excel.run(async (context) => {
            const sheet = context.workbook.worksheets.getActiveWorksheet();
            const table = sheet.tables.getItem(TABLE_NAME);
            const rangeToWriteTo = table.getDataBodyRange()
                .getCell(0, 0).getResizedRange(ROWS - 1, COLUMNS - 1);
            const fullTableRange = table.getRange().getCell(0, 0).getBoundingRect(rangeToWriteTo);
            table.convertToRange();
            rangeToWriteTo.values = createArray();
            sheet.tables.add(fullTableRange, true /*hasHeaders*/);
        });   

We'll keep you posted once you no longer need the workaround.

LanceEa commented 6 years ago

@Zlatkovsky - Thanks for the snippet. That is actually how I'm doing it currently. It definitely is faster than adding/resetting the values of the table. But, once you start doing larger datasets it gets really slow. I have my add-in limited to 5,000 records for my early testers until we can test with the new fixes.

sameera commented 6 years ago

@Zlatkovsky I can also confirm @LanceEa observation. We've been doing the same in our product for the last year. We've seen that range <-> table conversion can take up to 3 minutes when the table/range spans 20-30,000 rows. (Our tables typically contain few formulas columns as well, which could also contribute to the latency). It's also quite unfeasible to buffer this amount of data in memory and write at once. So, we have to make multiple range<->table conversions.

Our scenario also has to account for the fact that we get our data in batches from a web service. And not being able to use inter-column formulas like you said, pivot tables/charts that are driven by the table data set breaking during downloads, and the latency itself, forced us to move out of this approach. We currently use binding.addRowsAsync with relatively better performance.

The range<->table conversion also leads to Excel freezing up (specially noticeable on the desktop) while the conversion happens and the formatting changes that happen during the conversion is visually "disturbing" as well :) Of course, all of these are noticeable problems once you start working with large datasets in the excess of several of thousands of rows.

Zlatkovsky commented 6 years ago

Yep, I totally understand. The workaround is not meant to say that it should be a permanent solution. We are absolutely working on finding something we can do within the product itself. It’s just that there are actually several compounding issues that contribute to this, so we don’t have an ETA for a full solution yet.

My intent is for anyone new stumbling onto the thread to see one possible workaround (for a subset of cases), until we have a permanent solution in place.

Zlatkovsky commented 6 years ago

I'm happy to say that we do have a solution to the issue. I don't have an ETA yet for when it will get to Insider Fast and to the general public, but I did just try the latest internal build on my laptop, and the result is much faster. E.g., the snippet that I sent above, which was taking ~4 seconds to execute, now executes in 0.3 seconds.

I will keep you posted once there is something that you can try out publicly. But good news is that the good news is coming :-)

sameera commented 6 years ago

Great news @Zlatkovsky ! :) Just curious, 1) does the fix account for having formulas on the table that will recalculate on write? 2) would appending to the table in batches will also work as fast?

Zlatkovsky commented 6 years ago

@sameera, so that I can try it out with the two parameters that you mentioned: do you want to take my “slow” snippet and modify it to your scenario, and send it back to me? I can then run it on my end, and check. Thanks!

sameera commented 6 years ago

@Zlatkovsky Please see https://gist.github.com/sameera/9c71b79aa354a4724da4a897d2d254c3 You can simulate the batched-writing by clicking on the Issue or Workaround buttons repeatedly.

I didn't really see a performance drop due to formulas. But you'll see the appending being extremely slow.

LanceEa commented 6 years ago

@Zlatkovsky @tong-1324 - I'm just checking back in to see how this work is going?

Also, which host will gain from these perf enhancement? Excel Online, Windows, Mac, or All? The reason I ask is because a majority of our initial users will be leveraging Excel Online because unfortunately our IT org hasn't rolled out Windows 10/Office 2016 yet.

weidezhong-zz commented 6 years ago

The improvement we are working on will be available on all platforms. You will likely get it for online sooner than other platforms as online has a faster release cadence.

deinspanjer commented 6 years ago

Has the general performance been tested on other platforms such as Mac?

I've got a use case where I'm trying to generate between 200 and 1000 rows with about 65 columns. I am frequently getting errors or crashes or on a good run, it takes 2 minutes with one core pegged at 100% CPU.

I have tried modifying the code to convert the table to a range as shown above, removed formulas, removed styling, and also tried setting calculation to manual. (I can't try the application.suspendApiCalculationUntilNextSync parameter since it isn't available for Mac yet.)

I'm at my wits end trying to port over fairly simple and straightforward code from Google Sheets / App Scripts to Excel. :/ Any guidance would be appreciated.

deinspanjer commented 6 years ago

I just tried it on a little ThinkPad, and the performance is significantly better than the Mac, even though Excel is only 32bit on Windows. :/

Also, it appears that the workaround above doesn't work at all if the table has structured references in it. Is this something that will continue to be a problem after the improvement you are releasing?

weidezhong-zz commented 6 years ago

The general performance improvement is not targeting any specific platform. It doesn't block any specific feature such as structured references either. Although Mac and Win32 are sharing the same code, the builds between the two platforms are not always fully sync'd, which may explain why some Mac builds are slower than the Win32 builds. In the end, we expect the two platforms to be fairly close in terms of performance.

sameera commented 6 years ago

@weidezhong Have the improvement been tested against the ScriptLab Snippet I posted above. I'm curious to know the outcome :)

weidezhong-zz commented 6 years ago

@sameera Yes, the improvement is slightly more effective than the workaround.

misaunde commented 6 years ago

+1 Any updates?

tong-1324 commented 6 years ago

Hi @misaunde, we have two major fix to improve performance on all platforms. The first is to improve the range.values API, which is availble starting from build 9021. The second one is to fix some object binding issue, and all the range and table related APIs should get benefit from it. This is available starting from build 9208. You just need to update your Excel to the latest build and you should be able to see those improvement.

In addition to the fix above, we are currently working on another API to let developers temporarlly suspend screen updating. Basically, it will work like the "SuspendApiCalculationUntilNextSync()". Both of these two could help you improve the overall performance. Also, we are going to publish a new document about how to make a better performance Excel Application very soon.

JuaneloJuanelo commented 6 years ago

@misaunde @sameera hey guys, Juan PM for Excel.js here. what the latest on this issue? are you happy with the perf now?

@tong-1324 can you please share our latest and our document on perf?

i would like to get feedback if our perf improvements are awesome now.

sameera commented 6 years ago

@JuaneloJuanelo At the moment, we are unable to port our application to API 1.7 due to many of our customers being stuck on the MSI version of Excel. We plan to start the move in our next major release around mid Q4 . Sorry, I cannot give you any feedback at the moment.

sameera commented 6 years ago

@JuaneloJuanelo I just tested out my own gist sample with script lab. I see a significant improvement. I tested appending up to 5000+ rows (in chunks of 100) and the elapsed time was under 0.5s (it started at around 0.1s and kept growing very slowly). This is now much faster than the workarounds! Thank you!

misaunde commented 6 years ago

@JuaneloJuanelo I definitely have noticed improvements of writing plain text to excel. Thank you. I just wrote 25000 rows x 3 columns in under .3 seconds. That's fantastic.

My only concern is that as soon as we add more than plain text (formatting, formulas, cell coloring, conditional formatting, columnWidth adjustments, etc), the performance gains seem to disappear. For example, I wrote the same 25000 rows (x 1 column) with range.getCell(i, 0).format.fill.color = 'yellow' and it took 31.223 seconds, even using suspendApiCalculationUntilNextSync().

This remains our biggest challenge using Office.js as almost none of our tools write just plain text. In fact, we're in the process of switching our largest tool over to the server now due to this limitation. I would love to hear more about temporarily suspending the screen updating when writing and a timeline for that feature.

If it's any help to anyone else, one thing we discovered that was crucial for performance was setting range.numberFormat before assigning anything to range.values: https://stackoverflow.com/a/49616210

Zlatkovsky commented 6 years ago

@misaunde , for the formatting case, could you please open up a separate question on https://stackoverflow.com/questions/tagged/office-js and link to it here? There is a workaround I can suggest that will likely do a lot of good for you -- but I want to make sure that other folks can also see it without it being lost as the 90th comment to an older issue. :-)

misaunde commented 6 years ago

@Zlatkovsky - Great news, here you go: https://stackoverflow.com/q/51735674/3806701