andreasohlund commented 8 years ago

Taskforce: William (Lead) / Igal / Greg / Ramon / Dennis / Hadi @WilliamBZA @hmemcpy @gbiellem @ramonsmits @dvdstelt @HEskandari

Discussion: #tf-v6-perf-tests

We need to run end to end performance tests for v6 transports and persistence for all the supported transport transaction modes.

I have a hunch that we can reuse the code and infra from the outbox tests?

Plan of Attack

Action plan can be found here

mauroservienti commented 8 years ago

@dvdstelt is working on something that is related to this as well

WilliamBZA commented 8 years ago

I'd like to participate in this, and I suspect @ramonsmits might like to as well.

WojcikMike commented 8 years ago

Count me in

andreasohlund commented 8 years ago

Given the potential overlap with #2 how should we play this?

WojcikMike commented 8 years ago

@andreasohlund so I assume you prefer to have 2 separate TF ideally without people overlap?

andreasohlund commented 8 years ago

Or we start with either #2 or #3 to avoid overlap since both will build on the outbox tests?

WojcikMike commented 8 years ago

sounds good to me. Pick one that is more important and one can start with it

andreasohlund commented 8 years ago

I'd vote for performance first?

If we agree can you sync with the TF of #2 to decide if they want to join here instead @WojcikMike ?

gbiellem commented 8 years ago

@WojcikMike

@gbiellem @johnsimons are you interested and have time please comment on #3

Sure count me in.

ramonsmits commented 8 years ago

Would be willing to participate in this as I have created numerous stress tests for SQL, Rabbit and MSMQ.

andreasohlund commented 8 years ago

So we have @WojcikMike @gbiellem @ramonsmits @WilliamBZA and potentially @dvdstelt interested in this one. Let's get it going, will take the lead? When can we have a preliminary POA ready?

andreasohlund commented 8 years ago

@dannycohen can you chime in with the data want to get out of this from a "this is why you should use v6" perspective?

WilliamBZA commented 8 years ago

Items to discuss and work on

[x] Identify requirement outcomes
- Compare v6 performance to v5 across all transports and persistors
[x] Copy OutboxPerformanceTests to EndToEnd Repo and rename
[x] Remove Random from the test suite
[x] Change the tests to run in isolated or process messages whilst add new messages
[x] Decide on the metrics to be measured
- Sustained message throughput
- Time to first message
[x] Decide on the measurement technique (is Metrics.Net sufficient?)
[x] Refactor test harness to be more flexible (taking the permutations of Transports + Persistence + Outbox on/off + Auditing on/off already gives 100 different cases)
- Maybe scan for ITestCase where each ITestCase defines the transport and persistence to use, as well as cleans up after itself and reports measurement data to the harness
[x] Implement different test cases
[x] Decide how we're going to automate these test runs on the build server
- Tests will be run on a time-based trigger

@dannycohen @andreasohlund can you please give us your thoughts for the first point in the meantime so we can understand the context of this task and make sure we move in the right direction?

ramonsmits commented 8 years ago

Additional metrics:

Time to first message

andreasohlund commented 8 years ago

Does it make sense to check CPU, Memory usage, thread count and IO utilization as well?

ramonsmits commented 8 years ago

Measuring each metric is always in correlation with the profile. Auditing for example, is part of that profile and the same for things like encryption but also environment things like number of cores, CPU specs.

If we don't re-use the same machine between runs in Azure / EWS then this has an impact on results as there can be CPU difference even though you choose the same instance type.

WilliamBZA commented 8 years ago

@andreasohlund I think definitely, yes. Thats was something that @ramonsmits brought up while we were playing around with our tests. Knowing what the environment was, and how it was responding during the tests would have allowed us to gain insights and make educated guesses about bottle-necks.

ramonsmits commented 8 years ago

I was also considering 'environment' in the profiles. With that I mean, Azure, AWS, AWS + reserved IOPS and/or On-premise as there behave incredibly different in regards to IO.

The first step would be to have an environment to run the tests in a replicable manner. We should test if we get similar results when the tests are run on a new VM.

ramonsmits commented 8 years ago

Also, how do we aggregate results so that we can see historical results? Seems to me we probably are not the first in solving this run a benchmark as part of a build, store its results by pull request/branch/commit/versionnr. and being able to view these resulst, see trends, compare by profile, etc.

andreasohlund commented 8 years ago

In the past we've used TeamCity custom stats https://confluence.jetbrains.com/display/TCD8/Custom+Chart but that was quite clunky to set up and work with so I suggest we look for other options if this is a must have

ramonsmits commented 8 years ago

We can potentially use https://github.com/nunit/docs/wiki/TestCaseData to run all benchmarks as regular unit tests?

If we use nunit as then we must make sure that the tests are not run in parallel as that would have a negative impact on results.

ramonsmits commented 8 years ago

Would also be nice to maybe add a couple of 'native' tests to get some baselines on performance of native Rabbit, MSMQ ASB, ASQ so we can compare how much is lost due to the middleware.

For example, when not using any of the reliability features and doing a basic Bus.Send(destination) would give us a nice indication of this.

andreasohlund commented 8 years ago

We can potentially use https://github.com/nunit/docs/wiki/TestCaseData to run all benchmarks as regular unit tests?

Would this be suitable for long running tests like we plan in #2 or is that out of scope?

WilliamBZA commented 8 years ago

Also should we measure how much concurrency and other settings affects the results? Or should each TestCase have it's pre-measured and pre-configured expected best-case settings?

dvdstelt commented 8 years ago

@ramonsmits Measuring perf without reliability features would be nice so we get better insight ourselves and can optimize from there. We should be careful with communicating this outside. We will make the tests public, but how about perf tests & results? Because people who lack performance in their own apps, might think they need to drop transactions and outbox and what not, for a little performance gain. This will hurt the in the long run.

So the question, are we doing perf tests for internal use, or are we going public with the numbers as well?

dannycohen commented 8 years ago

@WilliamBZA - Regarding "Identify requirement outcomes" (https://github.com/Particular/EndToEnd/issues/3#issuecomment-199233858) - here are my thoughts - potentially as an "output discussion starter" -

Looking at this issue from the perspective of attempting to define a V6 "Customer Value proposition" (https://github.com/Particular/PlatformDevelopment/issues/660) -

The main dependency on performance tests is for validating (or invalidating) some of the assumptions and benefits of NSB V6
Specifically, from the points defined in https://github.com/Particular/PlatformDevelopment/issues/660:
1. Dramatic throughput enhancements
2. Significant resource utilization improvements
So that can be the source from which we derive the required outcomes

Assumption 1. Dramatic throughput enhancements

Question: whats is the change in message throughput (measured in # of messages/sec) between NSB V5 and NSB V6 ?
- (should we also include NSB V4 since it is a widely used version ?)
Measurement assumptions:
- Measure "clean" throughput - without any significant IO/CPU workload on the endpoints. This is potentially less realistic but focuses on and highlights NSB evolution which is the point of this measurement
- Replicate measurements for all NSB out-of-the-box supported transports
- Hardware / hosting: I believe a medium-large Cloud VM hosting would be sufficient. The absolute throughput numbers should be large and impressive enough while the "small print hardware requirements" should not be daunting. Main objective here is to highlight difference between NSB versions (throughput differences between hardware profile is a different - and valuable - question)
Potential outcome of this investigation that can be used for various purposes:

Assumption 2. Significant resource utilization improvements

The goal of this specific question is to evaluate the cost changes of NSB V6 support for multiple endpoints being hosted in the same process.
Therefore, the questions are:
- What is the RAM footprint of an NSB Endpoint in V6 ? (assuming multiple endpoints are hosted in the same process)
- How many NSB V6 endpoints can we co-host on the same machine (compared to NSB pre-V6) ?

Thoughts ?

ramonsmits commented 8 years ago

Variables

https://github.com/Particular/EndToEnd/blob/docs/docs/variables.md

Metrics

https://github.com/Particular/EndToEnd/blob/docs/docs/metrics.md

ramonsmits commented 8 years ago

Comparing resource utilization

In order to test if V6 has a lower resource utilization, we should do processing in the same rate between both versions. This requires limiting on either the sender and/or receiver.

As V6 dropped limiting it seems our only option is to rate limit sends that trigger the whole benchmark flow.

Now we are able to compare CPU, RAM, etc. utilization for the same sustained throughput.

For example: At 200 message per second the RAM, CPU, IO utilization is X for V5 and Y for V6.

andreasohlund commented 8 years ago

Out of scope?

Timeouts Containers Distributor Gateway

Do we include the client side distribution? ( @SzymonPobiega @janovesk there seems to be an outstanding task to perf test that in https://github.com/Particular/PlatformDevelopment/issues/133 )

ramonsmits commented 8 years ago

https://github.com/Particular/ProductionTests/tree/master/src/PerformanceTests

Things from this project that we potentially can use:

Has a common shared project to share source code between major versions
Is using NUnit as the test runner
Creates an app domain per test
- Might not work with tracking all metrics in regards to utilization.
- Could also create a new process
High isolation for improving utilization metrics
Has an endpoint generator
Each permutation has its own artifacts folder (package versions, etc.)
Outputs results as NUnit
- Each test is tracked by TeamCity
Still needs a separate view to display all captures metrics per test.
By using NUnit its hard to scale out the test runs across a large number of agents / virtual machines
- We might be able to spread tests by using different build configurations

ramonsmits commented 8 years ago

How are we going to visualize captured metrics? These is the native NUnit option but that will not automatically collect metrics. Using Metrics.net helps in capturing metrics but not comparing them.

Maybe any of these can be used:

ramonsmits commented 8 years ago

@andreasohlund We could benchmark the client side distribution but that would then compare against the V5 distributor. Isn't that comparing apples to oranges? I think we currently should focus on validating Assumption 1. Dramatic throughput enhancements

ramonsmits commented 8 years ago

In order to remove randomization due to concurrency we should probably rely on generating deterministic message ID's, then this value can be used as the pseudo randomization seed. This could even be combined with maybe FLR count

SzymonPobiega commented 8 years ago

@andreasohlund @ramonsmits yes, please include the client-side distribution (only for MSMQ) to the test matrix. I don't see a reason to run these tests separately from this effort.

dvdstelt commented 8 years ago

As mentioned yesterday, we should test "publish" and "send" and such in isolation, but also test a single handler for multiple actions, so that we verify a real test.

incoming message
publish message
send message

or

incoming message into existing saga & publish and/or send messages
incoming message into non-existing saga & publish and/or send messages

And other variations we can think of. We had this with ServiceControl before with a large number of incoming messages, where retry of messages failed because it was busy with incoming messages.

dvdstelt commented 8 years ago

@ramonsmits You think we should run tests using NUnit? I don't think it automatically runs tests in parallel, which would have a negative effect on performance. And we can't say anything about ordering of tests, especially if we add tests later.

The TransportCompat has tests set up for all transports and we might be able to use code from it as well.

WojcikMike commented 8 years ago

Meeting 23rd of March 2016

Attendees: Igal, Dennis, Ramon, Michał, Danny, Andreas

We should cut scope to do MVP as to have some test before Alpha/Beta release of NSB
Testing performance of:
- Publish performance
- Saga performance
- Message Throughput
- Databus usage
Create production like example, for example when handling a message:
- Calling DB
- Calling WebService?
On the call we made initial agreement on what will be tested (check Ramon's comment with Environment)
Set up of the environment:
- We should agree on version of Raven, Rabbit, MS SQL Server, Oracle etc.
- In the end we will do permutation of the versions, we need to think how to do it
We should create a list of assumptions for the tests
Plan of attack was create in the description of this issue

gbiellem commented 8 years ago

We should agree on version of Raven, Rabbit, MS SQL Server, Oracle etc.

If we want multiple versions of SQL and Oracle I'll need to build some new agent images or we have the DB servers remote from the test machine which is more real world. Running several copies of SQL and Oracle in the same VM image isn't a good option.

Once the list is compiled let me know and I can provision agent images or standalone DB servers

ramonsmits commented 8 years ago

DTC Performance
- if we check DTC performance we must use remote DB and queue to get production like results @andreasohlund
Testing persistence performance is hard
- Seed data to generate a set that is sufficient in size to introduce latency
In the variables we need to include the infrastructural dependency version numbers
- RavenDB v3.0.x, MSSQL 2014, RabbitMQ 3.6.x
Each permutation can be tested in isolation thus on a separate set of machines. This might help in running test runs in parallel reducing time to run all permutations
What tools can be use to automate creation of enviroments
- TeamCity
- Docker
- Chef
- Puppet
- Ansible
- SaltStack
- Vagrant
- TeamCity Agents
- EC2
- Azure

ramonsmits commented 8 years ago

Benchmarks on the web:

RestBus.Benchmarks

Compares RestBus to MassTransit and NServiceBus
https://github.com/tenor/RestBus.Benchmarks
Uses callbacks
Tests synchronous request / response over asynchronous channel.

Related issues: Particular/PlatformDevelopment/issues/690

gbiellem commented 8 years ago

@ramonsmits

What tools can be use to automate creation of environments

I looked into this a while ago when @andreasohlund and I were discussing setting up long running tests. From that investigation I got pretty excited about AWS Cloud Formation scripts. They allow you to build up template designs of multi machine environments, which you can then kick off one or more instances of via script or through the Amazon Web UI

Basically you build a library of VM images and the templates details what machines to build from what images and also how they are networked together etc, You can also nest templates. One of the templates I looked at built a complete windows domain with multiple app servers and a failover SQL cluster.

gbiellem commented 8 years ago

Each permutation can be tested in isolation thus on a separate set of machines.

One other thing to consider is the limitations put in place by the cloud providers. For instance by default Azure limits the number of running Virtual CPU per subscription to 20. Amazon has a similar limit based on IOPS. Depending on how many variations we want to run concurrently this may be a constraint.

We can ask for increases in the limits but we have to propose a figure, they won't just remove them.

ramonsmits commented 8 years ago

Terminology

https://github.com/Particular/EndToEnd/blob/docs/docs/terminology.md

ramonsmits commented 8 years ago

Benchmark permutations

We have made a sheet where we defined the categories and the number or variable values to see how many permutations we have.

(screenshot misses distributor + client side distribution categories)

Google sheet Benchmark permutations:

https://docs.google.com/a/nservicebus.com/spreadsheets/d/1avUW8Y5gpcPqTxIBaq7X5OXXaE4lDU0e0ZA9FDFNygs/edit?usp=sharing

Combining all possible values for the alpha / beta release would result in 24,000 (!!) permutations. The defined catagories results in just 211 permutations benchmark.

Run duration of all benchmark permutations

If each permutationwould take 1 minute then running all permutations would take about 3.5 hours exluding environment setup / teardown. This should not be an issue to run on a daily or weekly or on-demand basis.

ramonsmits commented 8 years ago

Choice of environment

Based on the current variables it makes the most sense to run all benchmarks in Azure due to us supporting Azure Service Bus, Azure Storage Queues and indirectly also Azure SQL.

Running in Amazon EC2 makes sense later especially when comparing to other benchmarks as Amazon EC2 seems to be the defacto standard to use for benchmarking IAAS solutions.

SeanFeldman commented 8 years ago

by default Azure limits the number of running Virtual CPU per subscription to 20

This can be increased through a request with support.

gbiellem commented 8 years ago

@SeanFeldman yep - That's what I meant when i said:

We can ask for increases in the limits but we have to propose a figure, they won't just remove them.

gbiellem commented 8 years ago

based on the current variables it makes the most sense to run all benchmarks in Azure due to us supporting Azure Service Bus, Azure Storage Queues and indirectly also Azure SQL.

Not sure I agree - There is nothing preventing us use ASQ and ASB from Amazon and the AWS tooling is better for multi machine setups and tear downs .

If we are doing apples to apples comparisions then we are only interested in perf between V5 and V6 not the raw numbers. it would be interesting to see how much of an affect running the same rest in Azure and AWS has given the huge pipes and low latency between these guys in the US.

ramonsmits commented 8 years ago

@gbiellem Using ASB and ASQ from AWS would increase latency (assumption here..) compared to using that in the same data center which would have bad impact on throughput and / or average processing time.

andreasohlund commented 8 years ago

Doesn't this highlight that we have to separate the "test administration" infrastructure from the actual running of the tests? (ie we should be able to run things in both AWS and Azure)

Ie: Not run the tests on the build agents

Particular / EndToEnd

Performance tests for v6 | #tf-v6-perf-tests #3

Plan of Attack

Items to discuss and work on

Assumption 1. Dramatic throughput enhancements

Assumption 2. Significant resource utilization improvements

Variables

Metrics

Comparing resource utilization

Meeting 23rd of March 2016

Benchmarks on the web:

Terminology

Benchmark permutations

Run duration of all benchmark permutations

Choice of environment