Performance problem when testing instance with a lot of databases

michalporeba commented 6 years ago

To reproduce try:

Run a single database test against a remote instance, something with some network latency. Wait, what seems like forever.

Comments:

I think there might be two issues causing it. First of all the Get-DbaDatabase from dbatools appears to be inefficient. One call against an instance with only one user database results in 121 batch requests to the server. The second problem might be with how the test is structured. It appears the same Get-DbaDatabase code might be running multiple times for a single test.

It might be that Get-DbaDatabase is not as efficiante as it could be, or there is a reason for this complexity, but then perhaps it is too heavy to use it in simple tests in dbachecks.

michalporeba commented 6 years ago

The problem is with SMO and the Databases collection. When it is used in Get-DbaDatabase.ps1 in line 258 it starts making that fires number of requests the server, including those SELECT CASE WHEN has_dbaccess... for each database.

I can see the value of having an SMO Database object returned by Get-DbaDatabase in dbatools, I see how it makes tests easy to write and look pretty, but the same thing makes the tests painfully slow on scale and over network.

It probably would be possible that get somewhat better performance with using SetDefaultInitFields but then the requested fields would have to be somehow passed in each test to the Get-DbaDatabase and that would get ugly soon, still not really solving the issue of the unnecessary calls.

I would prefer to avoid using OMS alltogether and instead have something like Get-DbcDatabase or Get-DbcDatabaseForTesting that would use t-sql queries to get information about all of the databases on the instance or only those requested but in as few queries as possible. For further efficiency the databases could be cached and shared between the tests.

Let me know what you think. In the meantime I'll follow this path and see where it takes me.

potatoqualitee commented 6 years ago

i feared that was the case :disappointed:

we do use SetDefaultInitFields - i checked every one that didnt cause enumeration and added it to our internal connect-sqlinstance.

we should probably switch to connect-dbainstance then execute a t-sql command for each case where you're severely slowed.

michalporeba commented 6 years ago

With hundreds of databases it looks like i'm 'severely slowed' in every test I tried. So here is what I have done:

Created a function in Database.Tests.ps1 that gets the database list with (so far only some) properties obtained through a single t-sql request. The results are cashed so next time on the same run we don't have to go to the server.

I have updated only three tests so far: DatabaseCollation,SuspectPage,ValidDatabaseOwner. Previously the longest I waited for a single test to finish was ~90 minutes before I killed it. With the changes the three tests were done on all 700 databases in 43 seconds.

Please have a look at let me know what you think https://github.com/michalporeba/dbachecks/tree/faster_db_checks https://github.com/michalporeba/dbachecks/commit/968198123ba6853e05734f0d62deafccfb3f998d

SQLDBAWithABeard commented 6 years ago

Thank you @michalporeba

I would not want the function in the test, we would move that to the internal functions folder but this is a great start

I would be interested to know your thoughts as to the better performing (at a larger scale) option

1, Replace calls to Get-DbaDatabase with an internal helper function returning an object with verifiable properties that we can test. Pros - reduces crazy smo calls Cons - needs updating for additional tests that use alternative properties

2 Call to Get-DbaDatabase at the top of the script providing an object that can be tested thorughout the tests Pros - One call to Gt-DbaDatabase Cons - Still using SMO so performance, whilst individual test configuration is possible still should we be getting all of the information even for databases we have configured later to not test for htis thing

This (number 2 above ) is probably my least favourite

3 Internal helper function (using get-sqlinstance) to get the names of the databases on the server and each test uses that to gather the correct information for each test

Pros - enables lightweight calls to each instance and quicker response Cons - Test writing may get a little more complex

3a - If we were to use 3 create helper functions for gathering the information for each test these could use SMO, T-SQL or magic to get the required information - the only requirement is that is accurate and performant

Pros - Enables each test to be improved in its own right for performance Should enable lightweight testing Cons - more coding, bigger learnign curve for new people to add their tests

Discuss :-)

niphlod commented 6 years ago

I suffered this problem in get-dbadatabaseState and already tried a workaround: worked well.

Was thinking to add the same to get-dbadatabase as well because more and more commands are impacted if the targeted instance has, let's say, an offline db, and you target a single database, you still see SMO trying to enumerate something on the not-targeted database.

The workaround would filter based on the query and build SMO infos just for the required databases.

In the end, get-dbadatabase -sqlinstance foo won't probably see differences, but a lot of usecase scenario is using some filters .... the 90% of which is Name and/or isAccessible: that case shows in preliminary testing a 20x speedup, when you target a single db over an instance that holds 100.

potatoqualitee commented 6 years ago

Nice to see you here @niphlod 🤗

If we go with #1 or #3:

I believe we'd need to use Get-DbaInstance and not the internal command, Get-SqlInstance. This is because of scoping - the Pester tests will be executing commands that are available to the users.

niphlod commented 6 years ago

the approach would benefit ALL commands running get-dbadatabase, for starters, both for tests and every other usecase. IMHO it'll solve a big part of the issues, then we can find out the best way to rewrite tests accordingly

SQLDBAWithABeard commented 6 years ago

Thats really cool @niphlod I think its likely that a lot of the usage for dbachecks will be using all of the databases or perhaps only filtering out system dbs so dbachecks will still see the perf hit

potatoqualitee commented 6 years ago

Sounds fab

niphlod commented 6 years ago

let's start small with an example back in dbatools, then go from there ? fortunately what I though about is "incremental" and not disruptive.

niphlod commented 6 years ago

That being said, get-dbadatabase is going to have a speedup when "preselecting" databases, but if downlevel you need the full SMO for each and every database .... it's going to not be that faster. Same applies for having get-dbadatabase reporting infos about last backups, which have their own impact.

SQLDBAWithABeard commented 6 years ago

all of which makes me wonder if we move over to better lightweight tests built for each check

niphlod commented 6 years ago

see https://github.com/sqlcollaborative/dbatools/pull/3303

michalporeba commented 6 years ago

So for me the option #2 doesn't solve much, as even a single execution of Get-DbaDatabase takes hours so cashing the results will cut the execution from days to hours, but still it is not enough.

Personally I would strongly prefer option #1 with cashing. Get the information once and run all the necessary tests locally. (Perhaps a couple of times for different tests if it makes sense to have multiple functions)

As to the placement of the helper function, I started with the internal/functions but then I realized it is really part of the test, and it might need to be modified with every new test. That is why I moved it to the test file, I wanted to avoid changes to the 'internal' stuff with every new test we write. I see it as data collection for the test, nothing else. But we can move it back if you feel strongly about it.

I know the general advice is on change per branch, per PR, but here the change is in how the tests collect the information, so for the existing tests I think it makes sense to make all the changes in one branch, and one bigger PR.

I'll keep pushing today the way I started to see if I find any other issues.

niphlod commented 6 years ago

not to push this down but maybe wait a day or two for that PR to be in. Get-DbaDatabase showed "exponential" resource usage for instances with a lot of databases (meaning the current version looses a bit of time with instances with 10 databases and 100x for 50 databases, and so on)

michalporeba commented 6 years ago

I'm testing your change now. I have cloned it from your repository. I'm running simply Get-DbaDatabase -SqlInstance $server. It's been running for 10 minuts already and there is no sign of any results.

potatoqualitee commented 6 years ago

This may be one of the properties that causes SMO enumeration. In that case, we'd need to execute some slim T-SQL, but happy that niph found a performance issue I introduced to avoid pipeline issues with Remove-DbaDatabase.

SQLDBAWithABeard commented 6 years ago

Adding this in here as it was discussed in slack

mostly we are looping through the databases and calling a dbatools command so my idea of a quick fix is to write an interanl function which gets the name of dbs only and where the Get-DBAdatabase is used in a foreach replace it

Which would be a start in reducing the time

BUT

as @wsmelton says

he only problem with that is commands being called that iterate over that Database class in SMO...which Find-DbaDuplicateIndex does. Anything that touches $server.Databases takes a hit on systems with a high amount of databases. So while that will save some time in dbachecks...it will still hit performance problems in places where the database command in dbatools has to iterate over that Database class.

So I think we will find that we will also want ot look at the paces where dbatools commands use that class

michalporeba commented 6 years ago

So having read all of your wise comments again and again I'm getting a bit confused so... I'll explain my favourite solution without necessarily referencing to individual ideas we have discussed (and numbered) previously.

But here is a disclaimer first: I'm not looking for a 'quick fix'. I don't mind putting in the hours if that means the checks are accurate and fast. Also, any use of SMO objects in this context adds, in my opoinion, unnecessary delays and memory requirements.

Problem 1: How to get the data for the checks My prefered approach is to get as much information as possible, as fast as possible, with as few roundtrips as possible. Ideally it means a T-SQL query that is executed once for all of the Database tests. Realistically I expect to have a few queries, some perhaps will have to be still run per database, but a lof ot tests can be covered with a single query as I have demonstrated in https://github.com/michalporeba/dbachecks/blob/faster_db_checks/checks/Database.Tests.ps1.

I hear you @SQLDBAWithABeard when you say that the tests should be independent. Why would we pull all that information in if we want to run only a single test. But let's think about it. What are use cases for running a single test, say AutoShrink or AutoClose. The way I see it, I would probably never use it. For me dbachecks is not a tool for spot checks. You could do that quite easily with dbatools. I want to know what are the problems whatever they are, so I would probably run Invoke-DbcCheck -Check Database -Show Fails accross all of my instances. rather than a specific test.

In my testing against large number of databases I find the performance benefit of such caching approach significant even when compared to T-SQL scripts for each test.

Problem 2: Support of wide range of SQL Server versions Using SMO helps with this goal. That's the only reason I'm not too happy with have to give it up, but performance must be the priority. Slow tests never get run. Definitely not frequently enough. That is one of the reasons why I switched my immediate focus onto the tests for the checks. We need to be able to validate that the checks are not only faster, but that they are still covering at least 2008 - 2017 versions. Ideally 2005, perhaps 2000 (although personally I don't have access to any pre 2008 R2 instances). With good test coverage it should be fairly straight forward to make the switch to the more direct T-SQL approach with confidence.

Said that we will have to accept that probably there will be tests which we cannot perform efficiently with T-SQL on a specific version (DMVs are added with every version) and in those cases we should be smart enough (it is easy to check the version) we could revert to the SMO.

For me it would be very important to tag the tests which might be slow (SlowDatabase tag), or those that we know are very fast (FastDatabase tag) so when running against big instances I can choose whether I want to prioritise the coverage, or have the results in finite time.

Problem 3: Checks for the latest features It is similar to Problem 2. New features, new recommendations are constantly being added. One might think of a test to check if a QueryStore is enabled or disabled depending on a preference, or if it is configured the way that is desired. How do you test for it on a 2008 R2 instance? Or even 2012?

I propose that in those cases we skip the test (using the It -Skip of course) if the version of the instance is not high enough to support the version. I think that's fair approach. I'd rather have a test that can be applied only to a subset of my instances, rather than not to have that test at all.

If we accept that it is not difficult to imagine, that we can further make a dynamic distinction weather the test on the specific version is implemented in a 'fast' or 'slow' way. We could have a configuration option, or a parameter (which would be useful only for big instances) where we could say -FastTestsOnly which would run the tests only if the target instance version allows for quick version of the test.

Problem 4: Beauty and Simplicity There is value in simplicity. There is value in beauty. But I don't think that should be overall goal. I hear your concerns about the ease of access. That is important, but you can always have a set of simple sample tests to help people get into it, and perhaps a well documented example of a sofisticated (I really don't want to call it complex) test. That soffisticated test would be aware of versions, could have faster or slower option. Statistically I'd imaging more people will be interested in using well perfoming tests, than writing their own.

And of course there are always ways to hide complexity from people who are just curious what the tests do, and would just like to read them without necessarily seeing all the details.

Summary Optimize for multi check runs. Limit the number of calls, roundtrips. Support wide range of versions without excessive fear of conditional logic. Accept that some tests will be faster than others. Allow to choose time or coverage. It is more important to support new features rather than have consistency between all of the versions. Complexity should be avoided if possible, but it shouldn't be the objective of the project. Complexity can and should be hidden as much as practical. Internal complexity can be OK as long as it can be explained and as long as there is still simple entry point if somebody just wants to add a simple custom test.

Next Steps (as I see it) I know it all sounds like a lot of work. But I'm very happy to do it. It obviously includes writing necessary documentation, perhaps a blog post about writing simple and sofisticated tests. I've got big instances to test it on, so I'll be able to see what is slow and what is not first hand.

I'd like to first finish the tests. (please see https://github.com/sqlcollaborative/dbachecks/pull/329).
Agree on the principles of the direction. Let's don't worry about the function name, or where it is placed just yet. It is a valuable feedback, but I find it distracts us from the bigger picture right now. I can easily refactor it before it makes it way to the develop branch.
I'll scrap the branch I have right now in https://github.com/michalporeba/dbachecks/tree/faster_db_checks. I just wanted to have a feel how many tests can be done with my 'global' approach, what other problems there are. I will start another branch in which I will start working through the tests one by one taking into account the above described 4 problems and agreed solutions. Rather tahn doing it all quick, I'll make sure that we solve individual challnages one at a time. Some of them will be different between versions, some of them will be slow and fast, some of them (perhaps if somebody adds more tests in the meantime) be working only for specific versions. We could then discuss the details of implementation looking at the specific problems, PR at a time and refactor the code as needed.

That is how I see it. At least the big picture and a proposed way to quickly, but with necessary consideration and scrutiny, to move to a more flexible and better performing solution.

TL;DR: Don't worry. Just let me do it, and we will figure out the details as and when we get there. PR at a time. (I should have put that at the top, shouldn't I?)

SQLDBAWithABeard commented 6 years ago

I like your approach and yes I agree totally that using SMO is MUCH better for different versions. Performance is the key issue here right now.

If you are happy to run with this I am very happy for you to do so. I agree with the direction that are taking and it is a much better solution overall than my suggestion.

Please carry on doing this awesome work, we are delighted that you are here (note - you may now become our test performance guru :-) )

michalporeba commented 6 years ago

I'm delighted to be here and I'm happy to run with it. You can officially assign me to this issue if you want. I'd appreciate some input on the #329 as moving ahead with the testing of checks will help me start on this issue too.

wsmelton commented 6 years ago

For me dbachecks is not a tool for spot checks

I don't see that as the norm. If I've run the whole dbachecks against my enterprise and I've gone through maintenance window for many of the servers. The sole purpose would be to do a spot check to see if you've fixed the major issues. It would be based on how the data from running all checks is handled...because I would not run all the checks to just verify a few items (that is what spots checks would be for).

SQLDBAWithABeard commented 6 years ago

dbatools is a tool for people to use in which ever method that they wish.

dbachecks is a tool for spot checks, it is a tool, for estate validation every night or every maintenance window it is a tool for validating automated builds, it is a tool for consultants to quickly check an instance or an estate,

These are just the ones that we know about right now a couple of weeks in and I am sure there will be other use cases as people think about how they can do this :-)

But lets get back to the issue at hand. We need to start to move forward with an identified issue which is the poor performance of dbachecks against instances with many databases.

Following the main points of @michal above which I in general agree with I would like us to begin to move forward this issue iteratively and agile

I propose that this can be done in this manner.

We can leave all of the database tests as they are and pick the one that is taking the longest and alter that to use the new Get-DbcDatabase command make sure it does fix the issue, commit, test, release and move onto the next one.

I would much rather do this and get feedback quickly than refactoring in a massive way requiring a lot of regression for a lot of tests

Annoyingly I can access GitHub from the plane but not Slack :-(

SQLDBAWithABeard commented 6 years ago

We also have this user voice opened, please feel free to upvote - https://feedback.azure.com/forums/908035-sql-server/suggestions/33535612-smo-enumerations-slow-with-hundreds-of-databases

SQLDBAWithABeard commented 6 years ago

Branch created https://github.com/sqlcollaborative/dbachecks/tree/just-get-database-name

Which replaces Get-DbaDatabase with Connect-DbaInstance where we just pass the database name to the dbatools command.

When I just get the database names the difference is significant over 10 runs against a local instance

0 Using Connect-DbaInstance - Total Milliseconds = 914.1239 0 Using Get-DBADatabase - Total Milliseconds = 23172.7354 1 Using Connect-DbaInstance - Total Milliseconds = 751.9027 1 Using Get-DBADatabase - Total Milliseconds = 23037.8592 2 Using Connect-DbaInstance - Total Milliseconds = 1116.4859 2 Using Get-DBADatabase - Total Milliseconds = 21625.6702 3 Using Connect-DbaInstance - Total Milliseconds = 716.4511 3 Using Get-DBADatabase - Total Milliseconds = 19886.4249 4 Using Connect-DbaInstance - Total Milliseconds = 724.1833 4 Using Get-DBADatabase - Total Milliseconds = 21875.134 5 Using Connect-DbaInstance - Total Milliseconds = 799.7293 5 Using Get-DBADatabase - Total Milliseconds = 21733.2845 6 Using Connect-DbaInstance - Total Milliseconds = 730.3497 6 Using Get-DBADatabase - Total Milliseconds = 22176.2373 7 Using Connect-DbaInstance - Total Milliseconds = 753.8491 7 Using Get-DBADatabase - Total Milliseconds = 21921.8575 8 Using Connect-DbaInstance - Total Milliseconds = 746.2548 8 Using Get-DBADatabase - Total Milliseconds = 20684.8815 9 Using Connect-DbaInstance - Total Milliseconds = 751.7676 9 Using Get-DBADatabase - Total Milliseconds = 20018.8875 10 Using Connect-DbaInstance - Total Milliseconds = 1054.8356 10 Using Get-DBADatabase - Total Milliseconds = 21756.195

When I test this branch by running

Invoke-DbcCheck -Check Database -ExcludeCheck TestLastBackupVerifyOnly -Show Fails

10 times against 4 local instances the results are not as significant but still worthwhile

SQLDBAWithABeard commented 6 years ago

With Get-DbaDatabase and Show Fails - Total Milliseconds 75709.562 With Get-DbaDatabase and Show Fails - Total Milliseconds 74605.0582 With Get-DbaDatabase and Show Fails - Total Milliseconds 74134.0152 With Get-DbaDatabase and Show Fails - Total Milliseconds 75777.9501 With Get-DbaDatabase and Show Fails - Total Milliseconds 74232.0118 With Get-DbaDatabase and Show Fails - Total Milliseconds 75499.3978 With Get-DbaDatabase and Show Fails - Total Milliseconds 74991.8319 With Get-DbaDatabase and Show Fails - Total Milliseconds 74827.0276 With Get-DbaDatabase and Show Fails - Total Milliseconds 74993.3597 With Get-DbaDatabase and Show Fails - Total Milliseconds 75324.0725 With Get-DbaDatabase and Show Fails - Total Milliseconds 74496.0108

With Connect-DbaInstance and Show Fails - Total Milliseconds 53653.0855 With Connect-DbaInstance and Show Fails - Total Milliseconds 42332.6917 With Connect-DbaInstance and Show Fails - Total Milliseconds 45434.012 With Connect-DbaInstance and Show Fails - Total Milliseconds 42767.692 With Connect-DbaInstance and Show Fails - Total Milliseconds 45302.1833 With Connect-DbaInstance and Show Fails - Total Milliseconds 42945.9753 With Connect-DbaInstance and Show Fails - Total Milliseconds 44791.3775 With Connect-DbaInstance and Show Fails - Total Milliseconds 42330.9826 With Connect-DbaInstance and Show Fails - Total Milliseconds 45311.8129 With Connect-DbaInstance and Show Fails - Total Milliseconds 43390.6929 With Connect-DbaInstance and Show Fails - Total Milliseconds 43985.1166

SQLDBAWithABeard commented 6 years ago

So reduced from roughly 75 seconds using measure command to roughly 45 seconds

SQLDBAWithABeard commented 6 years ago

The reason for this is to improve performance quickly and also because I have been asked both by clients and other users how to run just one or a couple of checks. Some people are building configurations for just a couple of checks so that if

"Incident like Issue 54321 occurs - ImportDbcConfig Thethingthatcaused54321test.json"

which only has 3 checks in it for quick resolution

michalporeba commented 6 years ago

I have just noticed how much more testing you have done. I am trying the Connect-DbaInstance approach, but so far it's been running for close to 10 minutes and still no results while my approach delivers test results of all updated checks in under 100 seconds.

Also, the last comment might be on a wrong issue but it gave me an idea. I'll create a new issue for it.

EDIT: it has now finished. 12:55 (so 775 seconds)

dataplat / dbachecks

Performance problem when testing instance with a lot of databases #316