dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

Suggestions on CpuMath Enhancement #823

Open briancylui opened 6 years ago

briancylui commented 6 years ago

Listed below are CpuMath enhancement suggestions raised during PR reviews for SSE and AVX intrinsics but only documented for future follow-up.

The individual issue pages that expand each issue are https://github.com/dotnet/machinelearning/issues/824 and #827 - #836.

Style

Functionality

For some algorithms (like Sum), it is possible to “double-compute” a few elements in the beginning and end to have better overall performance. See the following pseudo-code:

if addr not aligned
              tmp = unaligned load from addr
              tmp &= mask which zero's elements after the first aligned address
              result = tmp
              move addr forward to the first aligned address 

while addr is aligned and remaining bits >= 128
              result += aligned load
              addr += 128-bits

if any remaining
              addr = endAddr - 128
              tmp = unaligned load from addr
              tmp &= mask which zero's elements already processed
              result += tmp

Sum the elements in result (using "horizontal add" or "shuffle and add")

So, your overall algorithm will probably look like:

if (Avx.IsSupported && (Length >= AvxLimit))
{
    // Process 256-bits, we have a limit since 256-bit 
    // AVX instructions can cause a downclock in the CPU
    // Algorithm would be similar to the SSE pseudo-code
}
else if (Sse.IsSupported && (Length >= SseLimit))
{
    // Pseudo-code algorithm given above

    // 128-bit instructions operate at full frequency
    // and don't downclock the CPU, we can only use
    // them for more than 128-bits so we don't AV
}
else
{
    // Software Implementation
}

If you can’t “double-compute” for some reason, then you generally do the “software” processing for the beginning (to become aligned) and end (to catch stray elements). • AvxLimit is generally a number that takes into account the “downclocking” that can occur for heavy 256-bit instruction usage • SseLimit is generally 128-bits for algorithms where you can “double-compute” and some profiled number for other algorithms

helloguo commented 6 years ago

Another possible optimization is to use Avx2.GatherVector256() to load values from sparse arrays.