dfinke / ImportExcel

PowerShell module to import/export Excel spreadsheets, without Excel
https://www.powershellgallery.com/packages/ImportExcel/
Apache License 2.0
2.47k stars 399 forks source link

Import of worksheet extremely slow after upgrade to v4.0.8 #264

Closed mcgoo49 closed 6 years ago

mcgoo49 commented 6 years ago

I have a script that uses the Import-Excel module to read from an Excel file and then export the data to CSV. I have been using v3.0.1, and it has been working flawlessly. I recently upgraded the module to 4.0.8, and an import of a worksheet (~5300 rows and 36 rows) which previously took 3 seconds, now takes over two minutes. It took me a while to ferret out what was causing the slowdown, but finally narrowed it down to the worksheet import, and used Measure-Command doing a stand-alone import of the worksheet.

Once I uninstalled 4.0.8 and went back to 3.0.1, the observed time went back down to the 3 second timeframe. All tests were done on the exact same file, and similar results were observed on Windows 10 and Windows 7.

dfinke commented 6 years ago

Thanks for reporting. Did you mean 5300 rows and 36 columns? Also, could you post the xlsx? Wanted to get a good representation of the data.

mcgoo49 commented 6 years ago

Sorry, yes on the 36 columns. I stripped down my test Excel file to just the single worksheet and sanitized the data. I just re-ran the test with both v4.0.8 and v3.0.1 and these were the results

$path = "c:\temp\test.xlsx" $worksheet = "vms" measure-command {Import-Excel -path $path -WorkSheetname $worksheet}

v4.0.8 - TotalSeconds: 100.1808011 v3.0.1 - TotalSeconds: 3.2720823 Test.xlsx

Thanks for your help!

dfinke commented 6 years ago

Confirmed.

@DarkLite1 I'm thinking of taking the 3.01 looping code for the Import-Excel and using it in the latest version. I wanted to get your input on the approach.

DarkLite1 commented 6 years ago

This is inherent to the analyses done on the data. Did you try with the -DataOnly option? Anyhow, feel free to adjust but I'm quite sure it can't be made faster because of the current design.

mcgoo49 commented 6 years ago

@DarkLite1 , I tried with the -DataOnly option, and it actually took about 16 seconds longer than without it.

DarkLite1 commented 6 years ago

The problem with the old version is that i'ts not consistent. Using the Dimension property on a worksheet is not always correct, especially when there's only one cell with data in the worksheet and some other rare cases. There's a reason why we don't use that anymore in the latest version.

Version 3.0.1. was piping the data if memory serves, which is always faster than collecting it and then start filtering out things. So for speed, yes, I agree that the old version is faster. But for stability and code accuracy I have to say no.

So the challenge ahead is simple, if the Pester test file succeeds after the modifications all is good with me. I'm sorry I can't spend more time on the module for the moment as I am on vacation for 4 weeks. But I trust you guys to figure it out :wink:

dfinke commented 6 years ago

Well, the work around is to use the old version. I'll take a look. Maybe a an Import-Excel with a switch -RawSpeed is needed.

PowerShellToday commented 6 years ago

Hello, Quite new to the workflow of these kind of Projects. But if found a solution to speed things up for larger files at least. My sheet has ~52000 rows and 8 colums. Found out that these sort-object -unique are quite slow on these large arrays. So in the regions Get rows and columns and Filter out rows with data in columns... there is some time to save. Did some modifications that shaved the time to import from 357s to 79s.

1 Using .where instead of where-object 2 And the big improvments came when i replaced sort-object -unique with Calling on a new function I made.

function ConvertTo-SortedUnique
{
    <#
    .SYNOPSIS
        replacement for sort-object -Unique
        Much quicker
    #>

    Param (
        [Parameter(Mandatory)]
        [int32[]]$IntArray
    )
    $HashSet = new-object System.Collections.Generic.HashSet[int32]
    $IntArray.ForEach({[void]$HashSet.Add($_)})
    $List = new-object system.collections.generic.List[int32] $HashSet
    $List.Sort()
    Write-Output $List.ToArray()
}

See https://stackoverflow.com/questions/32385611/sort-very-large-text-file-in-powershell So for example

$Columns = $CellsWithValues.Start.Column | Sort-Object -Unique

gets replaced to

$Columns = ConvertTo-SortedUnique -IntArray $CellsWithValues.Start.Column

Added som verbose information Before and after change that timed the different sections. (I used the -dataonly switch) Before:

VERBOSE: Open file: 0,1565952
VERBOSE: Select worksheet: 1,7640999
VERBOSE: Set the top row: 0
VERBOSE: Get rows and columns: 207,3100862
VERBOSE: Create property names: 0,0080262
VERBOSE: Filter out rows with data in columns...: 96,5790907
VERBOSE: Filter out the top row when...: 0,3981653
VERBOSE: Create one object per row: 51,3791115

After:

VERBOSE: Open file: 0,1712032
VERBOSE: Select worksheet: 1,8459298
VERBOSE: Set the top row: 0
VERBOSE: Get rows and columns: 18,5084588
VERBOSE: Create property names: 0,0109703
VERBOSE: Filter out rows with data in columns...: 11,4135941
VERBOSE: Filter out the top row when...: 0,357216
VERBOSE: Create one object per row: 47,4332509

There is still the time consumed of Create one object per row but at least this would be a step in the right direction?

So is the next step to fork this and make a pull-Request?

PowerShellToday commented 6 years ago

Ok, my bad now I realised that there is a PR that does pretty much the same thing with another approach, in https://github.com/dfinke/ImportExcel/pull/337/files#diff-f50d26003c0baf98f417a44fc1cac92b

dfinke commented 6 years ago

Thanks @PowerShellToday, yeah I'm working to merge that PR.

Would you be interested in testing the module after I do?

PowerShellToday commented 6 years ago

Hello, @dfinke, I could definitely help out with Import-excel. I’m not so familiar with the other ones. There are many changes in that PR…, Maybe try to split it up in smaller chunks?

dfinke commented 6 years ago

@mcgoo49, @PowerShellToday. I created a release, https://github.com/dfinke/ImportExcel/releases/tag/v5.0.0

It has updates for performance with Import-Excel. Could you try it out, see if it is faster for you and if it breaks anything?

Thank you

PowerShellToday commented 6 years ago

Huge improvements, can't see anything breaking after some small tests. Will try to write som Pester-tests to help out with the testing. For my example above the load-time Went from 352s to 50s so great improvments !! Nice work!

DarkLite1 commented 6 years ago

That's fantastic guys! I love how the open source world cooperates. When the accompanying Pester test file throws no error, you can be sure it's production ready. Good job guys!

dfinke commented 6 years ago

Thanks for testing @PowerShellToday! Thanks @DarkLite1 and a huge thanks to @jhoneill for this PR. I started creating unit tests and wired up continuous integration with Appveyor. I did a simple timing test so when I merged @jhoneill PR it gave some indication of the improvement. On a 4K row file it went from 2000 ms to ~400ms. So, sub second response time (plus I find Appveyor to be slower).

Good stuff!

I'd like to focus on a pattern for unit tests, not sure what that is yet. Getting to fair level of tests and coverage will help in ruthless refactoring efforts and being more agile in delivering features.

Thanks everyone, much appreciated.

mcgoo49 commented 6 years ago

Yes, I’ll test it out vs the existing module I’m using. (3.0.1).

Thanks for doing this!

Mark

From: Doug Finke notifications@github.com Sent: Saturday, June 9, 2018 8:26 AM To: dfinke/ImportExcel ImportExcel@noreply.github.com Cc: Mark McGill mmcgill@vmware.com; Mention mention@noreply.github.com Subject: Re: [dfinke/ImportExcel] Import of worksheet extremely slow after upgrade to v4.0.8 (#264)

@mcgoo49https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmcgoo49&data=02%7C01%7Cmmcgill%40vmware.com%7Cd03898009e9b4cf4279908d5ce042e18%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636641439705958365&sdata=TzYjvyed4RX4muBR0qWLxsdJHEg%2BnpBeTnlyKX6%2BX7A%3D&reserved=0, @PowerShellTodayhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPowerShellToday&data=02%7C01%7Cmmcgill%40vmware.com%7Cd03898009e9b4cf4279908d5ce042e18%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636641439705968369&sdata=lrYroBiCjWzg2rvt5Gibk%2FYchKXbqpJinB9tZ3sMnN4%3D&reserved=0. I created a release, https://github.com/dfinke/ImportExcel/releases/tag/v5.0.0https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdfinke%2FImportExcel%2Freleases%2Ftag%2Fv5.0.0&data=02%7C01%7Cmmcgill%40vmware.com%7Cd03898009e9b4cf4279908d5ce042e18%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636641439705968369&sdata=wdxc2YaUA23S%2B5AXSAPoJSlUpaWNfJ3oUdcyyw%2BEHK4%3D&reserved=0

It has updates for performance with Import-Excel. Could you try it out, see if it is faster for you and if it breaks anything?

Thank you

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdfinke%2FImportExcel%2Fissues%2F264%23issuecomment-395965456&data=02%7C01%7Cmmcgill%40vmware.com%7Cd03898009e9b4cf4279908d5ce042e18%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636641439705978381&sdata=rMhfZQfS0o4hTDVt0v9etSqsRIQFbO%2F5pBxLCM%2FcSuU%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAbFTEjbW8JEA2IXInbIovgSxLGUJ5eTYks5t677ggaJpZM4Q-OSx&data=02%7C01%7Cmmcgill%40vmware.com%7Cd03898009e9b4cf4279908d5ce042e18%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636641439705988394&sdata=%2Fb8MN00UsdBtTwzKJNdylrHHBdo7g6qJdaIMJr7qSEQ%3D&reserved=0.

jhoneill commented 6 years ago

@dfinke , @PowerShellToday The bigger the file , the more time you save : sort seems to start as a quick sort, and fall back to a heap sort so anything which involves a sort on 10s of thousands of cells can get painful. Because PowerShell's hash tables are very efficient, if you want one property of something, or to ask "Are A..Z in a this large set" then they can save massive amounts of time. I figured out I could avoid the sorts completely just do a lot of insert-into-hash and then
$rows = ( $StartRow..$EndRow ).Where({$rowHash[$_]})

I knew that the .where method was faster than | where-object, and with big enough sheets it would matter - and the other thing which saved a surprising amount of time was 1..1000000 | foreach {function $_} is a lot slower than 1..1000000 | foreach {body of the function}

I wrote that up here https://jamesone111.wordpress.com/2018/05/14/a-couple-of-easy-boosts-for-powershell-performance/ which links to something I wrote on the speed ups I got with hash tables.

PowerShellToday commented 6 years ago

@jhoneill, Thanks for sharing. a great post. The thing about Calling a function vs run the code in the foreach block was interessting.... THANKS!

DarkLite1 commented 6 years ago

@jhoneill Amazing article! Learned something seriously interesting! Thanks for sharing 👍

jhoneill commented 6 years ago

@DarkLite1 , @PowerShellToday you're both very welcome :-)