OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.36k stars 1.49k forks source link

Optimize rpcc function in ARMv8 arch #2281

Closed MacChen01 closed 5 years ago

MacChen01 commented 5 years ago

Enviroment: ARMv8, CentOS 7.6, spark 2.3.0, OpenBLAS-0.3.7

Using OpenBLAS-0.3.7 lib in Spark's kmeans test case, the CPU usage of the __kernel_clock_gettime function exceeds approximately 5% . rpcc

MacChen01 commented 5 years ago

By replacing the __kernel_clock_gettime function with the value of the armv8 register cntvct_el0, the hotspot function did not appear, and improved performance. Patch file: 0001-armv8-rpcc-optimize.patch.txt

brada4 commented 5 years ago

Could you make it into a PR so that it goes through CI roll? I am wondering whats the effect on iOS, latest droid, BSD and Windows/ARM (latest was never seen in real world, others are being used sometimes. Could you check with brand new CentOS8 and/or elrepo kernel - maybe problem is just in old versions? Sorry to burden you, just trying to (ab-)use situation while your keyboard stays warm.

brada4 commented 5 years ago

Also vdso=off and altering current_clocksource, then it is kernel regression that we can work around. In either case your patch looks sane, if you dont do anything we will copy it to a PR later.

martin-frbg commented 5 years ago

@MacChen01 interesting find, thanks. I have created pull request #2282 from it to give it a spin, but I only just committed the Travis change to add an IOS build job and have not read up on if/how if is possible to run tests in an IOS emulator that way. @brada4 even if it turns out to be a problem with somewhat older kernels, I guess there should be no drawbacks to reading the register directly (where that is supported) so no need for burdening the contributor with trying out other RH kernels or parameters.

brada4 commented 5 years ago

if taking out vdso or flipping clocksource does any good it should go to people writing linux kernel config files, though as long as proposed workaround works nicely it is indifferent...

martin-frbg commented 5 years ago

From what I could find out so far, the direct register read is probably not supported on IOS (https://github.com/yse/easy_profiler/issues/138 ), and there appears to be a hardware bug in CortexA73 cpus that makes it necessary to do the read twice (https://www.spinics.net/lists/stable/msg186807.html)

brada4 commented 5 years ago

vdso sort of eliminates real syscalls, given it turns up so heavy - it could be worth

martin-frbg commented 5 years ago

Err, yes, I am just noting findings here while I am busy with other work.

MacChen01 commented 5 years ago

@martin-frbg

  1. Not supported on IOS problem: adding the compile adaptation https://bitbucket.org/berkeleylab/gasnet/pull-requests/95/release-bupc-2242/diff
  2. Read twice problem: adding ISB instruction can solve it? https://github.com/AnYpku/cmssw10x/blob/2e4c8c64bfc4a47bcda01676733d4fbabd1623d7/FWCore/Utilities/interface/HRRealTime.h
  3. Testcase ,like this ? @brada4 https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/profile_utils/cpu_utils_test.cc

More and more PRs take it , tensorflow、GASNet...

brada4 commented 5 years ago

Cross-compilation should work, expensive tests have no place in released library. I ask to test on the device, not to extract problem code - saying spark kmeans already did it.

What I asked - if CPU usage by VDSO 1/ disappears if it is disabed 2/disappears in more recent Linux kernel And - if it is possible to add -g switch to perf to see where from __kernel_clock_gettime is called.

martin-frbg commented 5 years ago

@MacChen01 what is the "compile adaption" in that big diff ? If it is just the "see if we can access that register at all", it is simpler to use #ifndef __IOS__. And as I understand it, the ISB addresses a more general multithreading scenario, (so may be necessary,) but does nothing for the particular A73 hardware erratum where (it seems) any read may momentarily return an overflowed value. I have no idea how prevalent the affected version of A73 is, but it may be safer to #ifndef that target as well and rely on the hope that it runs a patched kernel and libc. (Though that would still leave builds for compatible targets vulnerable if they execute on A73). Or one needs to redo the benchmarks for the "ISB plus double read workaround" and see if there is still any performance improvement. What is your hardware ?

brada4 commented 5 years ago

HW must be tsv110, what I am into is to first see if it is kernel/vdso version specific problem (i.e reading timer with syscalls turns up same or faster than vdso optimisation), and if there is something avoidable in whole call chain to reduce/replace those calls. It seems a bit overkill to get 5% CPU spent on reading timers in code whose primary function is to compute, not to time computation.

martin-frbg commented 5 years ago

The tsv110 is obvious from the screenshot, I'm more interested in a rough idea of the overall size of the system - something big like the new Hi1620/Kunpeng 920 ? (My RasPi4 is too constrained by both memory and cooling to be useful for benchmarking in the 5% range - I think I see a small speedup with the original patch)

MacChen01 commented 5 years ago

@martin-frbg My device is Kunpeng 920.

martin-frbg commented 5 years ago

Applied as #2282 (but disabled on IOS and Android as it appears to be a privileged instruction on at least the former)