aerospike / aerospike-client-csharp

Aerospike C# Client Library
70 stars 48 forks source link

Minimizing object allocations for less Heap 1 garbage collections and thus improving performance #34

Closed kuskmen closed 6 years ago

kuskmen commented 6 years ago

Notes: This is not still done for full framework, but first I want to see if it will be accepted between doing it there as well. Ran all the unit tests against local docker aerospike and all passed.(not sure if it is enough tho?) (Perhaps you have way of stress testing it? I would love to see how.) @BrianNichols , I guess you are Java Developer as I see the code :) , I can see a lot small changes that can be made in order to look like more .NET style client.

How Object Pool pattern works: Basically holds couple of default instances of some type so you can quickly re-use them.

Suggestions: If this pull request is accepted maybe next stop can be pooling sockets/connections (we will improve performance and our usage of bandwidth but this is for another pull request story.)

BrianNichols commented 6 years ago

I'm currently in the process of converting the .NET Core solution to Visual Studio 2017 and will not be accepting pull requests until that is complete.

Also, using the pool pattern does not always improve performance or minimize object allocation. For example, see this old code.

StringBuilder sb = new StringBuilder(100);
...
return sb.ToString();

The C# compiler is smart enough to recognize that sb is a local variable and will place sb's contents on the stack (no heap allocation). Replacing this code with a pool will actually perform worse than the original code. Also, your current pool implementation does not return the sb instance to the pool if an exception occurs between Allocate() and ReturnStringAndFree().

kuskmen commented 6 years ago

Everywhere in the Roslyn they are extensively using this pattern so I assume it is not something optional in case of performance and allocation (my opinion). What exception could happen between Allocate() and ReturnStringFree() ? Why would perform worse if you are only allocating needed number of times StringBuilder for the lifetime of the application? Could you show me where and how Roslyn recognize that SB is local and place it into the stack, this is actually interesting :) This is actually Just-In-Time Compilation that does this (first it needs to realize it then it compiles it) which is nice I didn't know about it, but I still think that Object Pool pattern is better approach.

khaf commented 6 years ago

@kuskmen Pools come with synchronization (locks), which in the case of highly concurrent systems are not desirable. We have already hit the limits of such patterns in our clients when benchmarking them on new CPUs with high core counts (64 and above).

Although I don't think StringBuilder is stack allocated (its contents are variable-size), I believe both the memory allocator and GC are smart enough to optimize it to insignificance.

kuskmen commented 6 years ago

But if this is true why C# compiler repo use this pattern extensively, I dont understand.. people working there must know the language at its finest :/

Doesnt matter, do you have some way to stress testing so I can make some benchmarks and supply some metrics if you dont I will try make some benchmarks with benchmarkdotNet but it will be good to share some scenarios, because I agree that suggesting some performance optimization without metrics is like suggesting nothing.

Regards, kuskmen

BrianNichols commented 6 years ago

I have benchmarked the StringBuilder code, old vs new (pooled). In single-threaded mode, the pooled method is indeed faster.

StringBuilder(threadCount, iterationsPerThread, old|new): ms

StringBuilder(1, 10000000, Old): 718 StringBuilder(1, 10000000, New): 577

With just 2 threads, the old method performs better.

StringBuilder(2, 10000000, Old): 936 StringBuilder(2, 10000000, New): 1357

The old method will outperform the pooled method for all thread counts > 1. In real world applications, the difference for any thread count is close to negligible since its takes 10 million iterations to produce a marginable difference and this code is rarely run in such a tight loop. Here is the benchmark code I used:

using System;
using System.Text;
using System.Threading;

namespace Test
{
    public class StringBuilderTest
    {
        public static void RunTests()
        {
            try
            {
                // Warmup caches.
                Benchmark(1, 1000, false, true);
                Benchmark(1, 1000, false, false);

                int threadCount = 2;
                int iterationsPerThread = 10000000;

                Benchmark(threadCount, iterationsPerThread, true, true);
                Benchmark(threadCount, iterationsPerThread, true, false);
            }
            catch (Exception e)
            {
                Console.WriteLine(e.ToString());
                Console.Write(e.StackTrace);
            }
        }

        public static void Benchmark(int threadCount, int iterationsPerThread, bool showResults, bool old)
        {
            BuilderThread[] threads = new BuilderThread[threadCount];

            for (int i = 0; i < threadCount; i++)
            {
                threads[i] = new BuilderThread(iterationsPerThread, old);
            }

            int begin = Environment.TickCount;

            foreach (BuilderThread thread in threads)
            {
                thread.Start();
            }

            foreach (BuilderThread thread in threads)
            {
                thread.Join();
            }

            int end = Environment.TickCount;
            int elapsed = end - begin;

            if (showResults)
            {
                string t = old ? "Old" : "New";
                System.Console.WriteLine("StringBuilder(" + threadCount + ", " + iterationsPerThread + ", " + t + "): " + elapsed);
            }
        }
    }

    public class BuilderThread
    {
        private static readonly StringBuilderPool _pool = new StringBuilderPool(() => new StringBuilder());

        private Thread thread;
        private int iterations;
        private bool old;

        public BuilderThread(int iterations, bool old)
        {
            this.iterations = iterations;
            this.old = old;
        }

        public void Start()
        {
            thread = new Thread(new ThreadStart(this.Run));
            thread.Start();
        }

        public void Join()
        {
            thread.Join();
            thread = null;
        }

        public void Run()
        {
            string ns = "test";
            string setName = "set";

            for (int i = 0; i < iterations; i++)
            {
                string s = old ? BuildStringOld(ns, setName) : BuildStringNew(ns, setName);

                // Keep optimizer honest by referencing the string.
                if (s.Length == 53)
                {
                    throw new Exception("S length is 53");
                }
            }
        }

        public string BuildStringOld(string ns, string setName)
        {
            StringBuilder sb = new StringBuilder(100);
            sb.Append("Begin");

            if (ns != null && ns.Length > 0)
            {
                sb.Append('.');
                sb.Append(ns);
            }

            if (setName != null && setName.Length > 0)
            {
                sb.Append('.');
                sb.Append(setName);
            }
            return sb.ToString();
        }

        public string BuildStringNew(string ns, string setName)
        {
            var sb = _pool.Allocate();
            sb.Append("Begin");

            if (!string.IsNullOrEmpty(ns))
            {
                sb.Append('.');
                sb.Append(ns);
            }

            if (!string.IsNullOrEmpty(setName))
            {
                sb.Append('.');
                sb.Append(setName);
            }
            return _pool.ReturnStringAndFree(sb);
        }
    }
}
kuskmen commented 6 years ago

Hmm nice, I will use your code to attach memory profiler as well and perhaps make the synchronizations and stress test it again, look I am not saying this is golden bullet but it can be used perhaps for other resource like sockets thing is I don’t know the domain yet very well but with stress testing and profiling I thing I can get to the point of doing something significant.

Thank you guys for being so responsive, I’ve learned a lot so far.

Regards, kuskmen