Suggestion: New BDN run mode that causes cache misses between iterations

damageboy commented 7 years ago

I got this "idea" from https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/

The basic premise of what @skarupke did is to get more realistic benchmark results by creating many thousands of identical test objects (dictionaries in that case) and call the benchmarked code on different instances on every invocation.

In this way every known cache has its worst performance exhibited (L1..3/TLB) because no cache-locality is enforced. A nifty aspect of this approach is that is required no privileged code to run in order to invalidate the caches, but it rather overloads them.

I think this is an interesting concept to bring in the BDN framework, as it allows users to get a better sense of how their code will behave in the real world where its not being run in a tight loop and everything comfortably fits into L3/L2/L1 caches.

As a framework feature I think this would require a decent size of work:

Somehow track all the amount of memory allocation performed by the tested code, in order to aggregate it into "bytes allocated per tested object"
allocate as many object that are required to saturate L1/L2/L3/TLB caches given the previously collected aggregation
run each test iteration on a different instance in sequence
Profit!

adamsitnik commented 7 years ago

@damageboy thanks for the idea, I really like it! The referenced blog post itself is really awesome!

@AndreyAkinshin what do you think?

I just wonder what would be the best way of invalidating the cache without any manged allocations (to keep the rest of our code like MemoryDiagnoser working)

btw today it could be achieved by using the [InterationSetup] attribute (docs)

AndreyAkinshin commented 7 years ago

I also like this idea. I don't think that we have to introduce another run strategy, but we can introduce another bool characteristic (like ColdCpuCache or InvalidateCpuCahe). It doesn't make sense for throughput mode and microbenchmarks, but it can be very useful for ColdStart/Monitoring + macrobenchmarks. And we don't need managed allocations for this feature, it will be enough to preallocate an array (e.g. same size as L3), and just "touch" all the elements several times (in fact, we don't need all elements, it will be enough to "touch" one byte from each 64B cache line). Another tricky thing here is that each physical core typically has own L1/L2 cache. So, probably it's a good idea to create several threads with specified processor affinity, each thread will invalidate own caches.

damageboy commented 7 years ago

@AndreyAkinshin I agree about the L1/L2/L3 ideas you presented, but one point worth mentioning though is that your approach will not cleanup the TLB cache. That might be less important, but still worth to mention....

For that, the only way is by allocating many pages and accessing them, much like L3 idea you presented.

@adamsitnik As for not doing ANY managed allocation, it is possible to do this, but will require some work:

Some cache behaviour control is limited to kernel mode/system calls

INVLPG to invalidate TLB caches
INVD to invalidate internal processor caches

While some cache control, x86 provides user-mode cache invalidation:

CLFLUSH And is normally provided by intrinsic functions of your compiler (GCC/MSVC)

As for system calls:

Windows, I think, only instruction cache flushing is supported with FlushInstructionCache
In Linux, cacheflush(2) can in theory be called from user space with ICACHE|DCACHE, but I'm pretty sure it's a no-op for DCACHE on Intel, and is mostly provided for MIPS/ARM, but I could be wrong...

redknightlois commented 7 years ago

@AndreyAkinshin It is important for micro-benchmarking too. Most of the new algorithms we are working on are cache aware and vary their behavior over multiple iterations. So still makes sense to monitor the cold startup behavior even in micro-benchmark mode.

WojciechNagorski commented 5 years ago

I wanted to do this task but I have some problem.

The biggest problem is create an appropriate test case.As @adamsitnik wrote:

btw today it could be achieved by using the [InterationSetup] attribute

I read the documentation about [InterationSetup] and I found the information:

It's not recommended to use this attribute in microbenchmarks because it can spoil the results. However, if you are writing a macrobenchmark (e.g. a benchmark which takes at least 100ms) and you want to prepare some data before each invocation, [IterationSetup] can be useful.

As @AndreyAkinshin added:

It doesn't make sense for throughput mode and microbenchmarks, but it can be very useful for ColdStart/Monitoring + macrobenchmarks.

The problem is to create the right test and compare the results with and without cache cleaning. (or maybe I don't have enough knowledge.)

My test class:

    [CsvMeasurementsExporter]
    [RPlotExporter]

    public class IntroColdCpuCache
    {
        [IterationSetup(Target = "WithoutCache")]
        public void IterationSetup()
        {
                //There I try to clear cache
        }

        private readonly int[] array0 = Enumerable.Range(1, 1024 * 1024).ToArray();

        [Benchmark]
        public int WithoutCache()
        {
            return ArraySum(array0); 
        }

        [Benchmark]
        public int WithCache()
        {
            return ArraySum(array0);
        }

        private static int ArraySum(int[] array)
        {
            int result = 0;
            for (int i = 0; i < array.Length; i++)
            {
                result += array[i];
            }

            return result;
        }
    }

If you want to see difference in result with and without cache and algorithm should take 100ms then it should use huge amount of data. I have a problem with creating a test that meets these requirements.

If test doesn't take 100 ms I shouldn't use [InterationSetup] and I shouldn't compare results with and without case.

First solution

I create two task for all processors and allocate 6MB array 10 times.

        [IterationSetup(Target = "WithoutCache")]
        public void IterationSetup()
        {
            var howManyProcessOnCore = 2;
            //Get the processor count of our machine.
            int cpuCount = Environment.ProcessorCount;

            //Get the our application's process.
            Process process = Process.GetCurrentProcess();

            //Since the application starts with a few threads, we have to record the offset.
            int offset = process.Threads.Count;
            Thread[] threads = new Thread[cpuCount * howManyProcessOnCore];

            //Create and start a number of threads.
            for (int i = 0; i < cpuCount * howManyProcessOnCore; ++i)
            {
                var t = new Thread(AllocateMemory) { IsBackground = true };
                t.Start();

                threads[i] = t;
            }

            // Refresh the process information in order to get the newest thread list.
            process.Refresh();

            //Set the affinity of newly created threads.
            var processThreads = process.Threads;
            for (int i = 0; i < cpuCount * howManyProcessOnCore; ++i)
            {
                if (processThreads.Count > i + offset) //check if process is still running.
                {
                    //distributes threads evenly on all processors.
                    process.Threads[i + offset].ProcessorAffinity = (IntPtr) (1L << (i % cpuCount));
                }
            }

            //Wait for all thread
            foreach (var thread in threads)
            {
                thread.Join();
            }
        }

        public void AllocateMemory()
        {
            var HowManyMB = 6;
            var HowManyPass = 10;
            var SizeOfOneArray = 64;
            for (int i = 0; i < (HowManyMB * 1024 * 1024 / SizeOfOneArray) * HowManyPass; i++)
            {
                var tmpArray = new int[SizeOfOneArray];

                for (int j = 0; j < SizeOfOneArray; j++)
                {
                    tmpArray[j] = i * j;
                }

                tmpArray[0] = 6;

                for (int j = 0; j < SizeOfOneArray; j++)
                {
                    Consumer(tmpArray[j]);
                }
            }
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        public void Consumer(int v)
        {

        }

I got the restult benchmarkdotnet samples introcoldcpucache-withcache-cummean

I don't think it is a good solution 😄

Second solution

I use FlushInstructinCache:

        [DllImport("kernel32.dll")]
        static extern bool FlushInstructionCache(IntPtr hProcess, IntPtr lpBaseAddress,
            UIntPtr dwSize);

        [IterationSetup(Target = "WithoutCash")]
        public void IterationSetup()
        {
            Process process = Process.GetCurrentProcess();
            FlushInstructionCache(process.Handle, IntPtr.Zero, UIntPtr.Zero);
        }

I got results: benchmarkdotnet samples introcoldcpucache-withcache-cummean

It looks strange too. I can't understand why I got different result for WithCase in first and second solution.

If I had right solution, I could implement this issue and make PR.

AndreyAkinshin commented 5 years ago

I like the idea. Let's continue this discussion in a PR. A few recommendations:

Instead of the cumulative mean plots, it's better to look at the summary tables with enabled statistical tests.
IntroColdCpuCache should be executed on a fixed core (Job has Environment.Affinity property)
We can introduce an enum CacheClearingStrategy with values like None, Allocations, Native. In this case, we can compare both approaches.

When you create a PR, I will try to improve IntroColdCpuCache.

WojciechNagorski commented 5 years ago

Great! I make PR next week, but I have question:

For now use [IterationSetup] or should I move this code to BDN? Maybe here, after ran IterationSetupAction https://github.com/dotnet/BenchmarkDotNet/blob/04a71586206a822bca56f0abdacefdc2e5fc1b01/src/BenchmarkDotNet/Engines/Engine.cs#L138-L139
Should I create console parameters for CacheClearingStrategy, HowManyMB, ...

AndreyAkinshin commented 5 years ago

Let's include it in Engine.RunIteration
I guess we can start without console parameters (we don't actually need them for all options that we have).

adamsitnik commented 4 years ago

The experiments run by @WojciechNagorski in #971 and #1058 have shown that it would not add too much value to BDN, so I am closing this issue. Thanks!

dotnet / BenchmarkDotNet

Suggestion: New BDN run mode that causes cache misses between iterations #553

First solution

Second solution