dfinke / ImportExcel

PowerShell module to import/export Excel spreadsheets, without Excel
https://www.powershellgallery.com/packages/ImportExcel/
Apache License 2.0
2.47k stars 400 forks source link

Export-Excel extremely slow on large spreadsheets #506

Closed worc4021 closed 5 years ago

worc4021 commented 5 years ago

Hi,

I've been trying to merge two excel spreadsheets together, the source xlsx files are just over 3MB, but when I call Export-Excel -Verbose it seems that is flies through the headers and then stalls. These spreadsheets have about 1700 columns and 300 rows, i.e. their union has slightly more. But these numbers are not unheard-of. Is this expected to scale poorly or am I doing something wrong? It stalls saving only one already:

$sheet = Import-Excel -Path \\path\to\my\xlsx
$sheet | Export-Excel \\path\to\my\output -Verbose -Debug

I can see that it is trying to do something in the task manager, but it's not progressing after

VERBOSE: Cell '1:1737' add header [...] 

Cheers, Manuel

dfinke commented 5 years ago

@worc4021 if you just do the import-Excel, does it return all the data? Would it be possible to scrub the xlsx file and post it?

worc4021 commented 5 years ago

Hi! Thank you for your quick response. Unfortunately, the reason I left out most of the useful bits from the verbose output is that the spreadsheets I am trying to combine hold simulation output metrics, i.e. work related. So I won't be able to share the spreadsheets, however, the spreadsheets hold very simple data, each row corresponds to one simulation, i.e. each data field is either empty, a string or a numerical value - no functions or anything like that. I hope this still helps. Manuel

dfinke commented 5 years ago

Ok, if you only do an import-Excel, does it work? What's the response time like?

So 1700x300 empty, string and numeric data?

worc4021 commented 5 years ago

Import-Excel is fine, takes about 1-2 seconds. 1700x300 was the one I tried, naturally since each row corresponds to one sim and each column corresponds to one metric the size can vary, but that's the ballpark.

dfinke commented 5 years ago

I created a sample set of objects. Took 22 minutes to export to excel. There may be some low hanging perf improvements that can be done, not sure how much effort or how much improvement it would bring.

worc4021 commented 5 years ago

Thank you for looking into this!

Sad times.. I prefer scripting over clicking :D

Thank you, Manuel

pkarunkar commented 5 years ago

I had the same issue but seems like happening only when using verbose switch.. If you remove that, it shud be faster.. I work with excel of 50000 rows x 25 columns...

With verbose takes 20-30 minutes Without it takes only 2 or less minutes...

On Tue, Dec 11, 2018, 12:44 Rainer Manuel Schaich <notifications@github.com wrote:

Thank you for looking into this!

  • So it is nothing I'm doing particularly wrong on my side.
  • Not sure whether this would be worthwhile. I just stumbled upon ImportExcel and thought it would be a great prost processing tool for when I compile my regressions to compare different sets of sims. The alternative is literally click 'Add rows' in Spotfire instead. So if you don't see something that obviously makes it scale poorly AND know an alternative implementation, I would not advise you to change the existing version.

Sad times.. I prefer scripting over clicking :D

Thank you, Manuel

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dfinke/ImportExcel/issues/506#issuecomment-446294183, or mute the thread https://github.com/notifications/unsubscribe-auth/Aq2jYfFdJPr2pmML4zsDTGivQDpD7mDSks5u3-7kgaJpZM4ZNKvI .

dfinke commented 5 years ago

@worc4021 Yeah, friends don't let friends click. Exactly why I built this :)

@pkarunkar Good point, you're right. -Verbose slows things down a lot. Unfortunately, I wasn't using -Verbose so, this probably the fastest it can be for now.

Maybe @jhoneill has some insight.

pkarunkar commented 5 years ago

:) I was just saying Doug. I m happy without verbose. I mean who cares showing up the data on the host like matrix movie unless I want to show off to some newbie that I am doing magic in powershell..

I just need the end result and it works great for me. But some things are really magical..

On Tue, Dec 11, 2018, 12:59 Doug Finke <notifications@github.com wrote:

@worc4021 https://github.com/worc4021 Yeah, friends don't let friends click. Exactly why I built this :)

@pkarunkar https://github.com/pkarunkar Good point, you're right. -Verbose slows things down a lot. Unfortunately, I wasn't using -Verbose so, this probably the fastest it can be for now.

Maybe @jhoneill https://github.com/jhoneill has some insight.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dfinke/ImportExcel/issues/506#issuecomment-446299472, or mute the thread https://github.com/notifications/unsubscribe-auth/Aq2jYeZNkOBnDuM2XMw5gmNduAQtzmycks5u3_KAgaJpZM4ZNKvI .

worc4021 commented 5 years ago

Maybe just a defence for the Verbose switch: When you run a command in the prompt and for over 5 minutes nothing happens you might try to see whether -Verbose produces any output... ;)

pkarunkar commented 5 years ago

Agreed, that's how I used to use verbose, then once I am happy my script is working, I mostly convert them to a background job. So it differs person to person. I still use verbose which shows few lines of information..

I just don't use it with excel.. Thanks for bringing it up though.

On Tue, Dec 11, 2018, 15:24 Rainer Manuel Schaich <notifications@github.com wrote:

Maybe just a defence for the Verbose switch: When you run a command in the prompt and for over 5 minutes nothing happens you might try to see whether -Verbose produces any output... ;)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dfinke/ImportExcel/issues/506#issuecomment-446349130, or mute the thread https://github.com/notifications/unsubscribe-auth/Aq2jYdNB7oU12AdNuN9NE5J5xOwQBMq9ks5u4BSUgaJpZM4ZNKvI .

jhoneill commented 5 years ago

First on Write-Verbose: it slows things down quite a lot and I removed it from the tightest looping part of export-excel because (in this case) it would output over 500K messages. That's just too many to be helpful.
I just tried 1..500000 | foreach {$x = $_ } -takes 7.7 seconds ; 1..500000 | foreach {$x = $_; write-verbose -Message $_ } takes 45 seconds when it is not outputting anything @pkarunkar - are you getting the these time differences with the current version, or are you getting lots of verbose output - I just did ps | export-excel -now and ps | export-excel -now -verbose And verbose was faster (but within experimental error)

Second. Don't ever, ever do ; Import-Excel | export-Excel or $var = import ; $var | export. Unless the data is very, very small and you want to lose any formatting you have on the spreadsheet. I wrote copy-excelWorksheet because you can copy a whole sheet within and between files VERY quickly. I also wrote Join-Worksheet to copy whole sheets to a common destination - this might not be exactly what @worc4021 wants but it may be possible to use some of that code. In this case - I'm not sure if there are perf implications to an PSCustomObject with 1700 properties, but that's more properties than ever seen before. If you know that your numbers are in numeric types, not strings, you'll get a decent speed improvement from -NoNumberConversion. but ultimately 1700x300 is 510,000 cells, and doug's test did them at about 400 cells/sec. Each value is checked to see if is a number, a hyperlink, a date, a timespan or a formula (and without -NoNumberConversion, a number in the form of a string), before assuming it is a string. It is then poked into a cell, and if it is a date, timespan or hyperlink formatting is applied (formatting cells 1 by 1 is an expensive operation). Where there are huge numbers of rows, it may be faster to built a datatable in memory and use Send-SQLDataToExcel, and you can get Excel data with a SQL query … but it should be faster to use the copy functionality which is there.

pkarunkar commented 5 years ago

@jhoneill : you caught me.. I experienced these in older versions of ImportExcel and totally stopped using verbose after that. Will try it tomorrow at office and reply back.

On Tue, Dec 11, 2018, 18:28 jhoneill <notifications@github.com wrote:

First on Write-Verbose: it slows things down quite a lot and I removed it from the tightest looping part of export-excel because (in this case) it would output over 500K messages. That's just too many to be helpful. I just tried 1..500000 | foreach {$x = $ } -takes 7.7 seconds ; 1..500000 | foreach {$x = $; write-verbose -Message $_ } takes 45 seconds when it is not outputting anything @pkarunkar https://github.com/pkarunkar - are you getting the these time differences with the current version, or are you getting lots of verbose output - I just did ps | export-excel -now and ps | export-excel -now -verbose And verbose was faster (but within experimental error)

Second. Don't ever, ever do ; Import-Excel | export-Excel or $var = import ; $var | export. Unless the data is very, very small and you want to lose any formatting you have on the spreadsheet. I wrote copy-excelWorksheet because you can copy a whole sheet within and between files VERY quickly. I also wrote Join-Worksheet to copy whole sheets to a common destination - this might not be exactly what @worc4021 https://github.com/worc4021 wants but it may be possible to use some of that code. In this case - I'm not sure if there are perf implications to an PSCustomObject with 1700 properties, but that's more properties than ever seen before. If you know that your numbers are in numeric types, not strings, you'll get a decent speed improvement from -NoNumberConversion. but ultimately 1700x300 is 510,000 cells, and doug's test did them at about 400 cells/sec. Each value is checked to see if is a number, a hyperlink, a date, a timespan or a formula (and without -NoNumberConversion, a number in the form of a string), before assuming it is a string. It is then poked into a cell, and if it is a date, timespan or hyperlink formatting is applied (formatting cells 1 by 1 is an expensive operation). Where there are huge numbers of rows, it may be faster to built a datatable in memory and use Send-SQLDataToExcel, and you can get Excel data with a SQL query … but it should be faster to use the copy functionality which is there.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dfinke/ImportExcel/issues/506#issuecomment-446402430, or mute the thread https://github.com/notifications/unsubscribe-auth/Aq2jYVgTcnVi0HclNhJaetobzEnCxZEGks5u4D-ygaJpZM4ZNKvI .

pkarunkar commented 5 years ago

@jhoneill : Ok I faked the test at home by copy pasting data into cells making it 70k rows and 32 cols. As you said, verbose now looks faster.. Except when I used $var | export-excel with or without verbose.(This is strange as I do this all the time at office with large excel and it has been working great without verbose.(note that i haven't tested latest version with verbose) ) Attaching the excel sheet for reuse and my measure-command outputs ..

Using -noheader since duplicate column header popped in while copy pasting.

`Measure-Command {Get-Service | Export-Excel $env:userprofile\Desktop\Newfolder\VerboseTest.xlsx} Measure-Command {Get-Service | Export-Excel $env:userprofile\Desktop\Newfolder\VerboseTest2.xlsx -Verbose}

Measure-Command {import-Excel $env:userprofile\Desktop\Newfolder\VerboseTestlargedata.xlsx -NoHeader } Measure-Command {import-Excel $env:userprofile\Desktop\Newfolder\VerboseTestlargedata.xlsx -NoHeader -Verbose}

Measure-Command {$Var = import-Excel $env:userprofile\Desktop\Newfolder\VerboseTestlargedata.xlsx -NoHeader } Measure-Command {$var1 = import-Excel $env:userprofile\Desktop\Newfolder\VerboseTestlargedata.xlsx -NoHeader -Verbose}

Measure-Command {$var1 | Export-Excel $env:userprofile\Desktop\Newfolder\VerboseTest.xlsx} Measure-Command {$var1 | Export-Excel $env:userprofile\Desktop\Newfolder\VerboseTest2.xlsx -Verbose}`

The last 2 commands took 8 minutes with or without verbose. That is still very fast comparing my past experiences. At office, I have different data types, string, date, numeric... I do not know if that would make any difference.

VerboseTestlargedata.xlsx

ExcelTestOutput.txt

jhoneill commented 5 years ago

I don't know if the EPPlus library gets slower to insert into the worksheet XML as the sheet gets bigger. I might try this out during the day, and see time goes up linearly or is more like a square law.

I made and 21,000 cell sheet (PS | export-Excel). import-excel <file> | export-excel -nonumberconversion * <file> took 6.4 seconds on my machine, which is a fairly quick 3270 cells per second. Copy-ExcelWorkSheet -SourceWorkbook <file1> -DestinationWorkbook <file2> Takes 0.76 seconds - about 8 times faster. I think this factor increase with bigger sheets.

Hence my advice not to do import | export …. it also loses formulas, which copy preserves.

jhoneill commented 5 years ago

OK I tried a couple of things. I made a spreadsheet with 50 Columns x 200 data rows = 10,000 data cells and imported it $tenKCells = Import-Excel C:\temp\10KCells.xlsx

Then exported it

$tenKCells | export-excel c:\temp\10.xlsx -ReturnRange
$tenKCells + $tenKCells | export-excel c:\temp\10.xlsx -ReturnRange
$tenKCells + $tenKCells + $tenKCells + $tenKCells | export-excel c:\temp\10.xlsx -ReturnRange  

Double the rows and the time doubles, so EPPlus scales linearly. 1.3 Seconds for 10K, 2.6 for 20K, 5.4 for 40K , 10.31 for 80K, 21.3 for 160K

Next switched to 200 columns x 50 rows. 10K cells should take ~ 1.3 secs. It took over 10 seconds. So I tried

$header = $tenkcells[1].PSObject.Properties.where( {$_.MemberType -like "*" }).Name  
$i = 0 ; foreach ($row in $tenKCells) { foreach ($Name in $Header) {$x = $row.$name ; $i++ }} 

7.4 seconds !

So, looking at each property of a PS Custom object with hundreds of properties gets slow. (Although it didn't seem to get any worse when I changed to 25 rows by 400 columns [Edit] There seems to be a step change some in the region of 75 to 80 properties. )

All the more reason to use copy and not import | export when the data is hundreds of columns wide. (It's not that many versions since Excel lost the 256 column limit - this is the first time I've know someone using > 256 columns in real-world use!)

dfinke commented 5 years ago

Off the top of my head. Was wondering if there was an -AsIs switch on the export, it bypassed all the property inspection and just pushed all the values into the sheet? Don't know without testing.

jhoneill commented 5 years ago

IIRC - no. Everything goes via Add-CellValue.

One can try
$t = $r | convertto-csv -NoTypeInformation ($r is a collection of rows) $x = $excel.Workbook.Worksheets[1].cells["A1"].LoadFromText($t)

But this is very slow to save. There are other LoadFrom methods which may also be worth looking at. What slows it down is not so much the checking of types as getting 300 properties per object and poking them into cells one by one. It wouldn't take much to put some test code together to try skipping add-cellvalue, but my hunch is it won't change the time much.

ctrl78 commented 5 years ago

Hi, i had same issues when i upgrade to last version, moving to version 5.0.1 fixed the slowness issue

MarcelSchuermann commented 5 years ago

I still have performance issues with Export-Excel in version 5.4.4. Any news how to improve performance?

Edit: The approach of @jhoneill with Send-SQLDataToExcel was a lot faster than saving the data in a variable first. I am getting following error when using Send-SQLDataToExcel now:

At C:\Program Files\WindowsPowerShell\Modules\ImportExcel\5.4.4\Set-Column.ps1:141 char:9
         $columnName = [OfficeOpenXml.ExcelCellAddress]::new(1,$column).Address - ...
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [], RuntimeException
    + FullyQualifiedErrorId : MethodNotFound

Any ideas?

Excel is created though but conversions are also not as I wish. int or decimal(38,5) are converted to dates for example.

Edit 2: In my local environment I got PS version 5.1.x and in my problem environment I got PS version 4.x. Might that be a problem?

Edit 3: I figured it out - it was caused by the old PSVersion 4.0. With 5.1 I have no problem.

ericleigh007 commented 5 years ago

So just to kick this dead horse for a while longer.

If I do $rows = import-excel 'file'  # with a 15k row file, that's about 8 seconds.
if I do $rows | export-excel 'file1' -autosize -autofilter -freetoprow   # with the rows, that's about 2.08 minutes
if I do @rows | export-excel 'file2' # that's about 1.30 minutes

To me, that means the save time is excessive. We must be doing a lot of chunking through the data in memory (since I don't see autosize and autofilter having any affect on write speed)

Am I crazy in hoping to see an improvement of 5-10 times here?

Thoughts?

dfinke commented 5 years ago

Expected. There is no bulk export. Each value in the data is inspected for a number of things as it is set in the worksheet cell.

jhoneill commented 5 years ago

I think we need to be clear that export-excel is not making any attempt to be optimized for speed. It is trying to be general purpose and export small to medium volumes of data. It adds each cell individually which you would expect to be a slower way than (for example) adding a database table in one block. It checks every value to see if it is a formula, a hyperlink, a date, or text that needs to and processes it accordingly.

On my machine with 500 items in a folder this takes 1.2 seconds

dir | select -Property name,Length,LastWriteTime | export-excel \temp\foo.xlsx -BoldTopRow -FreezeTopRow -AutoSize -AutoFilter

The rate is 1000-2000 cells per second. It will vary a little and obviously with machine speed, but that's the order of magnitute. edit My test on the train this morning seems to have run at about half normal speed. At home I'm gettting around 3000 -4000 cells/sec

By comparison this runs in 0.2 seconds

$t = New-Object -TypeName System.Data.DataTable
[void]$t.Columns.Add("Name",[string])
[void]$t.Columns.Add("Length",[int])
[void]$t.Columns.Add("Lastwrite",[datetime])
dir | foreach {$r = $t.NewRow() ; $r.name=$_.name; $r.length = $_.Length ; $r.lastWrite  = $_.LastWriteTime ; $t.rows.Add($r) }
Send-SQLDataToExcel -DataTable $t -Path \temp\foo.xlsx -BoldTopRow -FreezeTopRow -AutoSize -AutoFilter

If speed of execution is more important than something produced quickly (etc) AND you have the knowledge of the data to produce it then there are optimizations you can do for your data …

However I have found an significant optimization which I'm going to talk to @dfinke about.

ericleigh007 commented 5 years ago

The Datatable workaround works well in my case. I'll refactor my stuff to use them instead, using Convertto-Datatable, which will speed it up by perhaps 10x.

Thanks for suggesting it.

Anything you can do with a "significant optimization" would also be welcome, of COURSE.

jhoneill commented 5 years ago

@ericleigh007 You asked for 5-10x improvement. I give you ... ~ 10 x improvement - there was something I looked at somewhere else which I thought was too hard to change. It makes the script harder to read/maintain but the perf change is clear - getting up to 70K cells per second on my machine. There's some variablity - the difference between runs is greater than the change with -NoNumberConversion and -Autosize.

image

@dfinke this passes all the build tests, and I should have it on my repo shortly couple of final things to clear up. It is worse from a reading/mainting P.o.V but not massively so.

dfinke commented 5 years ago

Cool! Reminds me of the below, which never sat well with me. I'd prefer a different function so it can "more easily" be recognized, both in discovery and maintenance.

Haven't looked at what it would take to refactor though.

Get-Process | Get-Member

vs

Get-Member -InputObject Get-Process
jhoneill commented 5 years ago

That opens up a whole can of worms.... in a lot of places if $x is an array passing it via -InputObject causes the command (e.g. get-member) to look at the array itself. In others (e..g Format-Table) the members get processed. I think command2 $(command1) and
command1 | command2 are about equal - it depends how you're thinking in step-by-step, or end-product-first
command1 | command2 | command3 | command 4 is easier than
command4 (command3 (command2(command1)))) Though that's often how my excel formulas look - I'm a recovering LISP programmer so that's to be expected

What I've done is

  1. Renamed -TargetData to -InputObject (but with an alias so existing use of it doesn't break) and then do foreach ($targetData in $inputObject). This gives a small speed improvement over piping.
  2. Moved the Add-CellValue function into the main body - cutting out the function call gives a significant speed improvement.
  3. Moved the handling of "simple" types so the moved code doesn't need to be duplicated.
  4. Taken the "Autoformat only the first X rows" idea from #555 and also added special case handling for a table (if $inputObject is a table, Insert it in the begin block and remove it before the process block) - there was something about that in #555 as well.
ericleigh007 commented 5 years ago

Admittedly, sometimes PowerShell trying to help you with arrays vs objects gets sort of confusing.

I use [array]$obj in my code to make sure that things that I want to be arrays are always arrays.

Thanks, guys. This is awesome.

Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows 10


From: jhoneill notifications@github.com Sent: Tuesday, April 2, 2019 5:38:46 PM To: dfinke/ImportExcel Cc: Eric; Mention Subject: Re: [dfinke/ImportExcel] Export-Excel extremely slow on large spreadsheets (#506)

That opens up a whole can of worms.... in a lot of places if $x is an array passing it via -InputObject causes the command (e.g. get-member) to look at the array itself. In others (e..g Format-Table) the members get processed. I think command2 $(command1) and command1 | command2 are about equal - it depends how you're thinking in step-by-step, or end-product-first command1 | command2 | command3 | command 4 is easier than command4 (command3 (command2(command1)))) Though that's often how my excel formulas look - I'm a recovering LISP programmer so that's to be expected

What I've done is

  1. Renamed -TargetData to -InputObject (but with an alias so existing use of it doesn't break) and then do foreach ($targetData in $inputObject). This gives a small speed improvement over piping.
  2. Moved the Add-CellValue function into the main body - cutting out the function call gives a significant speed improvement.
  3. Moved the handling of "simple" types so the moved code doesn't need to be duplicated.
  4. Taken the "Autoformat only the first X rows" idea from #555https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdfinke%2FImportExcel%2Fissues%2F555&data=02%7C01%7C%7C71853008f1f94bdd12c108d6b7b39658%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636898379275059595&sdata=VfkCsQGh6p23rbI0G7kIOU2mIz6BIANoL0Ip1BAiK2g%3D&reserved=0 and also added special case handling for a table (if $inputObject is a table, Insert it in the begin block and remove it before the process block) - there was something about that in #555https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdfinke%2FImportExcel%2Fissues%2F555&data=02%7C01%7C%7C71853008f1f94bdd12c108d6b7b39658%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636898379275069606&sdata=VLju3uYvlu8PDl7zmXNakAE095bl7FxxX1mU4QmHjac%3D&reserved=0 as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdfinke%2FImportExcel%2Fissues%2F506%23issuecomment-479214967&data=02%7C01%7C%7C71853008f1f94bdd12c108d6b7b39658%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636898379275089624&sdata=8DIFgEFwb94RLyCphRPJGKWweSMjj%2BpROt%2BN8zjMOpQ%3D&reserved=0, or mute the threadhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAG-5aocdPfQ_mMcxMs6H5VyMYYTcyBjTks5vc83mgaJpZM4ZNKvI&data=02%7C01%7C%7C71853008f1f94bdd12c108d6b7b39658%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636898379275099627&sdata=ZUFi994ggWifDiolGJAo2kE%2B1KQ%2BSFjadHNFzj%2BgafY%3D&reserved=0.

ili101 commented 5 years ago

@jhoneill Thank you for implementing some of the stuff from my fork, The changes sound good, I hope it goes well so I can finely use the official version and stop needing to maintain my fork. Please update when it's ready so I can give it a look and do some testing.

dfinke commented 5 years ago

@jhoneill Knew you were a LISP coder. What's next? cdr car?

Probably going to need to float this as a preview so hopefully we can get this exercised and shake out as many issues a possible.

jhoneill commented 5 years ago

@dfinke ah yes, "You do anything with car, cdr, cond and setq". I haven't touched LISP or Pascal this century or Occam, or Forth, or Assembler. But "Everything's a list" …. er … hash-table, and "we don't need no stinking variables, nest everything!" is still with me.
I've given it a reasonable shake through the existing tests but there's enough of a change to make a preview a good idea.,

jhoneill commented 5 years ago

@ili101 , you're welcome. Your ideas were all good, I was just nervous about getting them all in and not breaking other things. I've also committed some changes to make Send-SQLDatatoExcel a lot more compact. It now brings the parameters from Export-Excel in as dynamic parameters rather than declaring everything. That was a reminder why I hate both dynamic parameters and parameter sets.
Because the table insert now happens in Export-Excel some of the things around tables, ranges, auto-sizing which need to happen with the insert can go back to Export-Excel too - that was something else you made the case for.

dfinke commented 5 years ago

The zip of the module an the updates can be downloaded here http://bit.ly/2OTxoaO

uSlackr commented 5 years ago

Wow this is so much faster. Just yesterday I timed the export of 950 custom PS objects to excel and it took18 seconds. Since the update - nearly instantaneous. Thanks for this excellent tool

dfinke commented 5 years ago

Curious, how many properties on each?

uSlackr commented 5 years ago

10 ish. Not much. Actual time went for 18 secs to sub 3 secs. On an SSD

\\Greg


From: Doug Finke notifications@github.com Sent: Thursday, April 25, 2019 17:53 To: dfinke/ImportExcel Cc: uSlackr; Comment Subject: Re: [dfinke/ImportExcel] Export-Excel extremely slow on large spreadsheets (#506)

Curious, how many properties on each?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/dfinke/ImportExcel/issues/506#issuecomment-486851404, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGEO6ZZXHVYGSN5JR5OXQ4LPSIR5PANCNFSM4GJUVPEA.

davidhigh commented 2 years ago

First of all -- thanks for this great package!

Here is just a quick presentation of my experiences in case anybody comes here for advice on how to speed up his script ... as this is basically the only post around the web which discusses this topic.

I had a long running script where Export-Excel took a significant part of the total run-time. The script focuses on regression testing and has to process and inspect all exported data. The whole, relatively complex data wrangling part took only about 5% of the total runtime, whereas Export-Excel took the other 95%.

After several trials (like converting to csv, exporting and texttocolumns, etc.) I examined the solution via "Send-SQLDataToExcel" by @jhoneill, which solved my issue. With this, the export to excel takes a time in the same order as the data processing part, and this brought my runtime from 20 minutes down to about 1 minute.

I just wanted to highlight this point, as the statement by jhoneill reads as "not so great" which motivated me to my own wicked trials ... in fact, the Send-SQLDataToExcel approach is great for performance.

uSlackr commented 2 years ago

Might help to say how large the data set is and whether any special export features are being used. I routinely process 10,000 lines with little impact from the export. Good tip on the Send-SQLDatatoExcel. And what version of importexcel you have

jhoneill commented 2 years ago

@davidhigh Thanks for the comments. Export-Excel can manage thousands of cells per second, but when you have millions of cells - either many rows or many columns or the product of both execution time balloons, it does seem to be linear - it doesn't take longer to add row 100,000 than row 100. I found the library we call, to do all work underneath import and export has the ability to push a table straight in - so I added the support for it. I don't know how the library does it but it inserts very large amounts very quickly - this route wins if your data is already available in a table object, if you need to load data from something else into a table that can use all the time you're saving.

davidhigh commented 2 years ago

I ran some further tests and here are the timings:

For 5000 rows and 55 columns:

  1. Export-Excel took 5.5 seconds
  2. Send-SQLDataToExcel took 0.5 seconds

For 30'000 rows and 150 columns (--about a factor of 16 larger in the number of rows times columns):

  1. Export-Excel took 57 minutes (with -NoNumberConversion * the sample took 54 minutes)
  2. Send-SQLDataToExcel took 3 seconds (!)

I took care to measure only the runtimes of the pure commands, i.e. without the time needed to generate the data. For this I called the Export-Excel cmdlet from the pipe as $myData | Export-Excel -excelPackage $pkg ..., but after inspection of the code this seems equivalent to -InputObject (as long as the input is no [DataTable]).

Two conclusions:

In my case, as I generate the comparison data during program execution, I could simply switch to [DataTable] instead of [PSObject] as results objects, with no significant impact on the runtime. But even if the data is only available as standard Powershell objects, converting it to a [DataTable] first and then do the export to Excel seems advantageous, which is basically trading memory versus runtime costs.

My suggestion for improvement is thus to incorporate the pattern "Convert-ToDataTable + Export" into Export-Excel, possibly via a switch -AsDataTable. The suggested usage would be:

$obj = [PSCustomObject]@{ ... }
$obj | Export-Excel -AsDataTable -Path "test.xslx"

The ConvertTo-DataTable part can be be based on this function here.

jhoneill commented 2 years ago

Keep in mind that Export-Excel works like this.

The time to execute the PowerShell part should scale linearly. Poking cell by cell is going to get slow when you get to big numbers, I'm a little surprised that you go 50,000 cells per second with your first sample.
IIRC the XLSx format doesn't store repeating values in the each cell, it stores the value once, and each cell with a value is a pointer - so each value added requires a lookup to see if that value exists, depending how that is implemented in the library we call, I can see how it might put an n-squared-time factor into the execution time, but your 16.3x number of cells would have taken 5.5 * 16.3^2 = 22 minutes so it is even worse than that with large numbers. Quite what magic is being done to turn a 30,000 x 150 table into XML format so fast I don't know, but selecting the unique values form 4,500,000 and translating 4,500,000 values to pointers is going be quicker than doing them one by one.

Perhaps writing something to create tables before sending them is the way forward for people with requirements for very large amounts of data.