Closed andreasohlund closed 8 years ago
@dvdstelt is working on something that is related to this as well
I'd like to participate in this, and I suspect @ramonsmits might like to as well.
Count me in
Given the potential overlap with #2 how should we play this?
@andreasohlund so I assume you prefer to have 2 separate TF ideally without people overlap?
Or we start with either #2 or #3 to avoid overlap since both will build on the outbox tests?
sounds good to me. Pick one that is more important and one can start with it
I'd vote for performance first?
If we agree can you sync with the TF of #2 to decide if they want to join here instead @WojcikMike ?
@WojcikMike
@gbiellem @johnsimons are you interested and have time please comment on #3
Sure count me in.
Would be willing to participate in this as I have created numerous stress tests for SQL, Rabbit and MSMQ.
So we have @WojcikMike @gbiellem @ramonsmits @WilliamBZA and potentially @dvdstelt interested in this one. Let's get it going, will take the lead? When can we have a preliminary POA ready?
@dannycohen can you chime in with the data want to get out of this from a "this is why you should use v6" perspective?
Random
from the test suiteITestCase
where each ITestCase
defines the transport and persistence to use, as well as cleans up after itself and reports measurement data to the harness@dannycohen @andreasohlund can you please give us your thoughts for the first point in the meantime so we can understand the context of this task and make sure we move in the right direction?
Additional metrics:
Does it make sense to check CPU, Memory usage, thread count and IO utilization as well?
Measuring each metric is always in correlation with the profile. Auditing for example, is part of that profile and the same for things like encryption but also environment things like number of cores, CPU specs.
If we don't re-use the same machine between runs in Azure / EWS then this has an impact on results as there can be CPU difference even though you choose the same instance type.
@andreasohlund I think definitely, yes. Thats was something that @ramonsmits brought up while we were playing around with our tests. Knowing what the environment was, and how it was responding during the tests would have allowed us to gain insights and make educated guesses about bottle-necks.
I was also considering 'environment' in the profiles. With that I mean, Azure, AWS, AWS + reserved IOPS and/or On-premise as there behave incredibly different in regards to IO.
The first step would be to have an environment to run the tests in a replicable manner. We should test if we get similar results when the tests are run on a new VM.
Also, how do we aggregate results so that we can see historical results? Seems to me we probably are not the first in solving this run a benchmark as part of a build, store its results by pull request/branch/commit/versionnr. and being able to view these resulst, see trends, compare by profile, etc.
In the past we've used TeamCity custom stats https://confluence.jetbrains.com/display/TCD8/Custom+Chart but that was quite clunky to set up and work with so I suggest we look for other options if this is a must have
We can potentially use https://github.com/nunit/docs/wiki/TestCaseData to run all benchmarks as regular unit tests?
If we use nunit as then we must make sure that the tests are not run in parallel as that would have a negative impact on results.
Would also be nice to maybe add a couple of 'native' tests to get some baselines on performance of native Rabbit, MSMQ ASB, ASQ so we can compare how much is lost due to the middleware.
For example, when not using any of the reliability features and doing a basic Bus.Send(destination) would give us a nice indication of this.
We can potentially use https://github.com/nunit/docs/wiki/TestCaseData to run all benchmarks as regular unit tests?
Would this be suitable for long running tests like we plan in #2 or is that out of scope?
Also should we measure how much concurrency and other settings affects the results? Or should each TestCase
have it's pre-measured and pre-configured expected best-case settings?
@ramonsmits Measuring perf without reliability features would be nice so we get better insight ourselves and can optimize from there. We should be careful with communicating this outside. We will make the tests public, but how about perf tests & results? Because people who lack performance in their own apps, might think they need to drop transactions and outbox and what not, for a little performance gain. This will hurt the in the long run.
So the question, are we doing perf tests for internal use, or are we going public with the numbers as well?
@WilliamBZA - Regarding "Identify requirement outcomes" (https://github.com/Particular/EndToEnd/issues/3#issuecomment-199233858) - here are my thoughts - potentially as an "output discussion starter" -
Looking at this issue from the perspective of attempting to define a V6 "Customer Value proposition" (https://github.com/Particular/PlatformDevelopment/issues/660) -
Thoughts ?
In order to test if V6 has a lower resource utilization, we should do processing in the same rate between both versions. This requires limiting on either the sender and/or receiver.
As V6 dropped limiting it seems our only option is to rate limit sends that trigger the whole benchmark flow.
Now we are able to compare CPU, RAM, etc. utilization for the same sustained throughput.
For example: At 200 message per second the RAM, CPU, IO utilization is X for V5 and Y for V6.
Out of scope?
Timeouts Containers Distributor Gateway
Do we include the client side distribution? ( @SzymonPobiega @janovesk there seems to be an outstanding task to perf test that in https://github.com/Particular/PlatformDevelopment/issues/133 )
https://github.com/Particular/ProductionTests/tree/master/src/PerformanceTests
Things from this project that we potentially can use:
How are we going to visualize captured metrics? These is the native NUnit option but that will not automatically collect metrics. Using Metrics.net helps in capturing metrics but not comparing them.
Maybe any of these can be used:
@andreasohlund We could benchmark the client side distribution but that would then compare against the V5 distributor. Isn't that comparing apples to oranges? I think we currently should focus on validating Assumption 1. Dramatic throughput enhancements
In order to remove randomization due to concurrency we should probably rely on generating deterministic message ID's, then this value can be used as the pseudo randomization seed. This could even be combined with maybe FLR count
@andreasohlund @ramonsmits yes, please include the client-side distribution (only for MSMQ) to the test matrix. I don't see a reason to run these tests separately from this effort.
As mentioned yesterday, we should test "publish" and "send" and such in isolation, but also test a single handler for multiple actions, so that we verify a real test.
or
And other variations we can think of. We had this with ServiceControl before with a large number of incoming messages, where retry of messages failed because it was busy with incoming messages.
@ramonsmits You think we should run tests using NUnit? I don't think it automatically runs tests in parallel, which would have a negative effect on performance. And we can't say anything about ordering of tests, especially if we add tests later.
The TransportCompat has tests set up for all transports and we might be able to use code from it as well.
Attendees: Igal, Dennis, Ramon, Michał, Danny, Andreas
We should agree on version of Raven, Rabbit, MS SQL Server, Oracle etc.
If we want multiple versions of SQL and Oracle I'll need to build some new agent images or we have the DB servers remote from the test machine which is more real world. Running several copies of SQL and Oracle in the same VM image isn't a good option.
Once the list is compiled let me know and I can provision agent images or standalone DB servers
RestBus.Benchmarks
Related issues: Particular/PlatformDevelopment/issues/690
@ramonsmits
What tools can be use to automate creation of environments
I looked into this a while ago when @andreasohlund and I were discussing setting up long running tests. From that investigation I got pretty excited about AWS Cloud Formation scripts. They allow you to build up template designs of multi machine environments, which you can then kick off one or more instances of via script or through the Amazon Web UI
Basically you build a library of VM images and the templates details what machines to build from what images and also how they are networked together etc, You can also nest templates. One of the templates I looked at built a complete windows domain with multiple app servers and a failover SQL cluster.
Each permutation can be tested in isolation thus on a separate set of machines.
One other thing to consider is the limitations put in place by the cloud providers. For instance by default Azure limits the number of running Virtual CPU per subscription to 20. Amazon has a similar limit based on IOPS. Depending on how many variations we want to run concurrently this may be a constraint.
We can ask for increases in the limits but we have to propose a figure, they won't just remove them.
We have made a sheet where we defined the categories and the number or variable values to see how many permutations we have.
(screenshot misses distributor + client side distribution categories)
Google sheet Benchmark permutations:
Combining all possible values for the alpha / beta release would result in 24,000 (!!) permutations. The defined catagories results in just 211 permutations benchmark.
If each permutationwould take 1 minute then running all permutations would take about 3.5 hours exluding environment setup / teardown. This should not be an issue to run on a daily or weekly or on-demand basis.
Based on the current variables it makes the most sense to run all benchmarks in Azure due to us supporting Azure Service Bus, Azure Storage Queues and indirectly also Azure SQL.
Running in Amazon EC2 makes sense later especially when comparing to other benchmarks as Amazon EC2 seems to be the defacto standard to use for benchmarking IAAS solutions.
by default Azure limits the number of running Virtual CPU per subscription to 20
This can be increased through a request with support.
@SeanFeldman yep - That's what I meant when i said:
We can ask for increases in the limits but we have to propose a figure, they won't just remove them.
based on the current variables it makes the most sense to run all benchmarks in Azure due to us supporting Azure Service Bus, Azure Storage Queues and indirectly also Azure SQL.
Not sure I agree - There is nothing preventing us use ASQ and ASB from Amazon and the AWS tooling is better for multi machine setups and tear downs .
If we are doing apples to apples comparisions then we are only interested in perf between V5 and V6 not the raw numbers. it would be interesting to see how much of an affect running the same rest in Azure and AWS has given the huge pipes and low latency between these guys in the US.
@gbiellem Using ASB and ASQ from AWS would increase latency (assumption here..) compared to using that in the same data center which would have bad impact on throughput and / or average processing time.
Doesn't this highlight that we have to separate the "test administration" infrastructure from the actual running of the tests? (ie we should be able to run things in both AWS and Azure)
Ie: Not run the tests on the build agents
Taskforce: William (Lead) / Igal / Greg / Ramon / Dennis / Hadi @WilliamBZA @hmemcpy @gbiellem @ramonsmits @dvdstelt @HEskandari
Discussion: #tf-v6-perf-tests
We need to run end to end performance tests for v6 transports and persistence for all the supported transport transaction modes.
I have a hunch that we can reuse the code and infra from the outbox tests?
Plan of Attack
Action plan can be found here