OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.37k stars 1.5k forks source link

Denormal floats handling #1237

Open Arech opened 7 years ago

Arech commented 7 years ago

Dear @xianyi and everyone!

Is there a way to forbid denormals in the OpenBLAS?

I tried to execute the following code (MSVC2015) before calling any OpenBLAS routines:

    unsigned int current_word = 0;
    _controlfp_s(&current_word, _DN_FLUSH, _MCW_DN);
    _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
    _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

and it looked like it worked for some time (about a year)... Does it really affect OpenBLAS denormals handling (especially inside worker threads)?

I started to use some additional OpenBLAS functions and noticed that denormals started to reappear in results again and that extremely slows down computations. I'm heavily suspecting that it is the OpenBLAS, who is responsible for them. Probably, the worker threads might still have enabled denormals?...

How to get rid of them?

I'm using precompiled OpenBLAS-v0.2.19-Win64-int32.

Here is the complete list of used functions (if it helps):

UPD: indeed, cblas_ssyrk() is a first function which produces the first denormalized float during execution... :-(

How to change denormals behaviour?

May be there is a (reachable from outside) way to execute custom code in context of OpenBLAS's worker threads if there's no predefined way to control denormals?

brada4 commented 7 years ago

Care to mention processor ID? OpenBLAS is not compiled by MSVC, and unlikely MSVC macros will have any influence whatsoever on imported DLL file. You may use CONSISTENT_FPCSR=1 build option to get consistent float handling between SSE and AVX

Arech commented 7 years ago

Care to mention processor ID?

AMD Phenom II X6 1090, if it matters... It doesn't have AVX, SSE only. OS: Windows 7 SP1.

OpenBLAS is not compiled by MSVC, and unlikely MSVC macros will have any influence whatsoever on imported DLL file.

These functions+macroses change processor state (some bits in processor's control registers), therefore they do affect how subsequent code (including imported DLL) is executed by a processor, don't they? I don't remember exactly if Windows maintains separate versions of these control registers for different threads. Anyway, code of the same thread is still affected no matter which compiler was used to make it.

brada4 commented 7 years ago

I asked you if passing CONSISTENT_FPCSR compile option to OpenBLAS solves your problem.

Can you provide sample code that reproduces your problem?

Arech commented 7 years ago

I asked you if passing CONSISTENT_FPCSR compile option to OpenBLAS solves your problem.

I'm going to try it, however it'll require a plenty of time to setup build environment for OpenBLAS.

Can you provide sample code that reproduces your problem?

Problem occurs during fairly complex neural network training using my nntl project. I'll try to isolate the problem source and make a sample code out of it to reduce the number of unrelated code.

What I learned till this moment is this:

  1. First denormals occurs after a call to cblas_ssyrk() which computes a symmetrical matrix C = 1/ARowsCnt A' A. Matrix A has a size of about 150x70 and may contain some columns with a values of very small magnitudes (still normalized numbers).

  2. After first denormals appears in matrix C described in previous step, denormals start spreading through the other numeric data that relies on C despite the fact, that I explicitly forbid their usage by changing processor's control registers. It greatly affects performance: one training epochs lasts about 22-23 seconds when there are no denormals, but once they appear, the time required to process a single epoch quickly raises to 50-80+ seconds up to 200-210+ seconds and even more. So I get about x10 times slowdown with denormals, thought the code would work perfectly fine (and fast!) if denormals would be just flushed to zero.

  3. If I accompany every call to OpenBLAS with a subsequent call to a function that explicitly disables denomals (like the one posted in my first post), then denormals will not spread over the numeric data greatly and the worst epoch time I'd see will be less than 40-42s. Which is quite good comparing to 210s, but still about x2 times greater, than it should be.

UPD: point 3 is incorrect and probably OpenBLAS does NOT change how denormals are handled. I forgot that the same time I added accompanying call for the OpenBLAS routines, I added the same code that disables denormals to my worker thread pool startup routine (I forgot to do it earlier) - that is the real reason that helped to control how denormals spread over the numeric data. My worker threads was working with denormals previously, though the main thread don't. My bad.

All of this lead me to a conclusion that OpenBLAS totally don't takes into account user's intentions regarding denormals and (always?) turns them on, slowing computations.

brada4 commented 7 years ago

Fastest build environment is to install some Linux in a virtual machine and cross-compile DYNAMIC_ARCH=1 DLL.

You mentioned one year that problem appeared - can you attribute it to some new OpenBLAS release or something changed with your computers?

Wikipedia page on subnormal floats lists some methods to avoid them - scale, log+exp, increase precision among others. C sample would be very good to see/understand if/where openblas (or BLAS in general) has some computation that tragically downscales interim values. I think you can show us input to nntl which demos issue instead.

You can try lower-brain cores (see https://github.com/xianyi/OpenBLAS/tree/develop/kernel/x86_64 for fulllist) via OPENBLAS_CORETYPE, to see if any makes your software to act differently.

Arech commented 7 years ago

Fastest build environment is to install some Linux in a virtual machine and cross-compile DYNAMIC_ARCH=1 DLL.

Thanks, I'll try this method.

You mentioned one year that problem appeared - can you attribute it to some new OpenBLAS release or something changed with your computers?

There were no change in computer hardware and I don't think the problem might be related to a OpenBLAS release change. I have implemented some new algorithms that uses new functions (they weren't used before) cblas_ssyrk() and cblas_ssymm(). Now it looks like these algorithms in some special cases (I've just stuck onto them) might produce very small numbers and OpenBLAS just tries to keep computation precision and uses denormals...

Wikipedia page on subnormal floats lists some methods to avoid them - scale, log+exp, increase precision among others.

All of this is absolutely not an option. In a nutshell neural network training process based on a gradient descent algorithm, which works quite reliable even with a low-precision arithmetic (much lower than 32bits - it was shown, that even 8bits of floating point precision might be enough). This task requires performance extremely more than a computational precision (and the said applies to newly implemented algorithms from the previous paragraph too), therefore simply dropping denormals is the best idea to follow.

C sample would be very good to see/understand if/where openblas (or BLAS in general) has some computation that tragically downscales interim values. I think you can show us input to nntl which demos issue instead.

I'll try to provide it, but I need some time for it...

You can try lower-brain cores...

Well... Even if this cores won't be messing with processor's control registers, they won't be as fast as the most suitable core for my processor. So it might look like a trade bad for worse, am I right?

martin-frbg commented 7 years ago

Possibly (cross-)compiling OpenBLAS with either the gcc flag "-ffast-math" or the VS flag /fp:fast will force denormals to zero. See https://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x for some suggestions

Arech commented 7 years ago

@martin-frbg Thanks, I'll try to recompile OpenBLAS with this flag.

BTW, don't know about gcc, but in VC this flag does nothing with denormals. Explicit code is still required to change processor's state.

brada4 commented 7 years ago

FPCSR option (supposedly, maybe not always right) does exactly that on each thread. if you find option fixing your problem - it is nice proposal for defaults used to build binary DLL (edit) openblas work threads are initialized before (win)main, for obvious reason FPCSR change in one thread does nothing to others (edit2) BLAS pre-dates IEEE FP spec, which means that it is not fantastic or consistent regarding NaNs, Denormals and other anomalies. Also out-of order processing introduces some slight accuracy changes (for quick example on rounding - make sum of 1000 random values, now sort values one and other way and make sums again)

martin-frbg commented 7 years ago

Could be a mingw issue with control word settings not carrying over to threads https://github.com/numpy/numpy/issues/5194 (though I only glanced through that). If the mingw build does not assume -mfpmath=sse by default it may be using the legacy fpu in 80bit "extended precision" mode.

brada4 commented 7 years ago

CSR is inherited (or not) way before MSVC EXE can interfere. Measurement with dlopen() would be nice to tell if it is inherited or not (or if it makes any change with test cases)

Arech commented 7 years ago

Here is a very simple code how to get denormals with cblas_ssyrk() using a normal source data: https://github.com/Arech/tmp/blob/master/DenormalsDemo/DenormalsDemo/DenormalsDemo.cpp

One may checkout the whole repository and try it live.

I need a way to instuct OpenBLAS to flush all denormals to zero...

BTW: This sample code fails to reproduce my previous claim that OpenBLAS changes the way denormals are handled by a processor. It just demonstrates how they appears in data. It is a perfectly real-world example: there is a quite useful algorithm of reducing cross-covariations between neurons activations in a neural-network that (obviously) requires a computation of covariance matrix. This is done by first computing de-meaned activation values (subtracting its mean activation from each neuron) and then computing covariance matrix C = 1/rowsCnt A' A, where A is de-meaned matrix of neuron activations. Either activation values could be very small, or they might almost all be near their mean. In any case, the de-meaned matrix A will be full of very small but still normalized FP values, which will induce denormalized floats in matrix C.

Still investigating who's in charge of changing denormals handling settings in the original code...

brada4 commented 7 years ago

You must scale the matrix to better numbers and not hit extremes (syrk has parameter for that). Your input value range is quite limited.

Arech commented 7 years ago

This task does not require a precise answer for a very small values. It requires fast answer (it's fine if it is very aproximate for a small values). There's absolutely no point in wasting precious cycles to scale the data, that can be perfectly fine and absolutely automatically set to zero.

brada4 commented 7 years ago

Try 2 things: If FPCSR build works right If LoadLibrary with default build inherits your FPCSR in threads (Maybe not because of mingw issue mentioned earlier)

There is no instrumentation to pass fp csr to threads, nor it is automatic

_SCAL does not take CPU cycles, it is memory-bound, you waste 100x more cycles inducing denormals.

Arech commented 7 years ago

If LoadLibrary with default build inherits your FPCSR in threads (Maybe not because of mingw issue mentioned earlier)

No change from the base version. Still produces denormals.

There is no instrumentation to pass fp csr to threads, nor it is automatic

I don't get your point. Could you please elaborate?

_SCAL does not take CPU cycles, it is memory-bound, you waste 100x more cycles inducing denormals.

The same as previous. Elaborate?

brada4 commented 7 years ago

Mingw defect applies to you too. Try re-build option.

Arech commented 7 years ago

Still playing with the sample code. Just to remind:

I've noticed that run-time denormals settings seems to affects at least a some part the output matrix C. When denormals are disabled (#define DISABLE_DENORMALS 1 @ line#20), then the first denormal appears at about position 288-289 of the C matrix. Most of the values of C are perfect zeroes.

However, if denormals are enabled (#define DISABLE_DENORMALS 0 @ line#20) then the first denormal appear very early (i=1 or 2) and most of the values of C are indeed denormals or very small normals.

It seems that FpCsr&MxCsr are thread-local registers and therefore every good thread-pool implementation must provide a way to set the same setting for denormals for every worker thread. Or provide a way to execute custom code in every worker thread context.

BTW: why none of the OpenBLAS thread pool functions are exported? Should any of exec_blas() / exec_blas_async() or gotoblas_pthread() be exported, the problem would be solved easily... Not counting that other users of OpenBLAS could use them for their own code (for example, I had to write my own thread pool for nntl and keep twice more threads in the app (including OpenBLAS's) then it is really necessary).

Arech commented 7 years ago

Important update: my claim that OpenBLAS might be changing how denormals are handled are probably false. Made an update for the comment.

Probably the only thing that should be done is just make denormals handling coherent across the threads. I'm wondering why Intel says that for MKL it's enough to just change MXCSR in the caller thread... Do they pass related parts of MXCSR of the callee to worker threads each call?

brada4 commented 7 years ago

Can you verify IF what is done with CSR is correct AND addresses your issue?

MKL licence sort of prohibits reverse engineering, from description it looks that they scan input for denormals wasting the cpu cycles and use CSR as input to select right routine, and finding one denormal either zeroes it X wastes even more cycles with "software routine".

Arech commented 7 years ago

Can you verify IF what is done with CSR is correct AND addresses your issue?

I'm sorry, I don't understand the question.. English is a foreign language for me. Could you please write in a different words what you want me to verify?

brada4 commented 7 years ago

Can you make your new DLL as follows (on a Linux VM):

make -s DYNAMIC_ARCH=1 CONSISTENT_FPCSR=1 \ 
   CC=x86_64-w64-mingw32-gcc FC=x86_64-w64-mingw32-gfortran \
   HOSTCC=gcc NUM_THREADS=64

And test with that DLL...

it is only option that actually enables OpenBLAS tackling FPU configuration. Indeed IF adding build option solves your problem it hints that it is better set default for binary builds.

Do we understand eachother now?

Arech commented 7 years ago

I think yes. I'm setting up Debian9 on VM now... I'll write more once I got new info...

brada4 commented 7 years ago

I can help if you get stuck trying.

Arech commented 7 years ago

Yep, CONSISTENT_FPCSR=1 was indeed helpful. Now the test passes and no denormals occuring during NN training. Training time is also greatly stabilized.

Thank you!

Now I'll try make an OpenBLAS build with smallest possible run-time overhead and see if it helps to drive training time down a bit more... But that's another story not related to this ticket.

Regarding setting a build with CONSISTENT_FPCSR=1 as a default... CONSISTENT_FPCSR=1 turns on a code that checks the state of FpCsr&MxCsr on each call and passes the desired settings to worker threads, am I right? If so, I would prefer to have as low run-time overhead as possible. Passing the desired setting to OpenBLAS via, for example, environment variable is totally fine as well as just using a special regime-switching function, similar to openblas_set_num_threads(). I think it's a very rare case when user within a single app would need to perform computations with and without denormals almost in parallel (so it would be hard/very inconvenient to switch computations regimes manually by calling a special function) - they would think that CONSISTENT_FPCSR=1 is what they need. For the majority of others CONSISTENT_FPCSR=0 and some means to set the denormals regime is preferable, imho.

brada4 commented 7 years ago

It just sets FPCSRs on threads once when started unknowingly working around MinGW problems see driver/others/blas_server*.c for code for respective ifdef I think it is missing on main and single thread still and to be completely formally correct you need to copy that ifdef section(s) to your main(), it will also fix NetLIB BLAS/LAPACK if you ever use them. @xianyi what do you think about making least-surprise default DLL build 0.2.20 ?

Arech commented 7 years ago

Mmm... I totally got stuck with modifying Andrew's @brada4 command line to suit some other needs... I wrote a post about it in the corresponding GoogleGroup, but the group seems to be slightly..."unliving" so I decided to ask here if somebody could help me with that issue?

Here is my problem description. Thanks in advance!

martin-frbg commented 7 years ago

You did run make clean between your various build attempts I assume ?

Arech commented 7 years ago

@martin-frbg, hmm...)))) Actually, I did use it sometimes and so it happened that it didn't help. Probably because I'd used some invalid flags combination. Thank you for the suggestion, I didn't know it's mandatory.

However, the problem still persists.

I've just executed

make clean
make CONSISTENT_FPCSR=1 CC=x86_64-w64-mingw32-gcc FC=x86_64-w64-mingw32-gfortran HOSTCC=gcc NUM_THREADS=64 TARGET=BARCELONA

and still get the same error:

...
make[1]: Leaving directory '/home/user/OpenBLAS_src/test'
make -j 4 -C utest all
make[1]: Entering directory '/home/user/OpenBLAS_src/utest'
x86_64-w64-mingw32-gcc -O2 -DMS_ABI -DUTEST_CHECK -DSANITY_CHECK -DREFNAME=utest_mainf_ -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -DSMP_SERVER -DNO_WARMUP -DCONSISTENT_FPCSR -DMAX_CPU_NUMBER=64 -DASMNAME=utest_main -DASMFNAME=utest_main_ -DNAME=utest_main_ -DCNAME=utest_main -DCHAR_NAME=\"utest_main_\" -DCHAR_CNAME=\"utest_main\" -DNO_AFFINITY -I..   -c -o utest_main.o utest_main.c
x86_64-w64-mingw32-gcc -O2 -DMS_ABI -DUTEST_CHECK -DSANITY_CHECK -DREFNAME=test_amaxf_ -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -DSMP_SERVER -DNO_WARMUP -DCONSISTENT_FPCSR -DMAX_CPU_NUMBER=64 -DASMNAME=test_amax -DASMFNAME=test_amax_ -DNAME=test_amax_ -DCNAME=test_amax -DCHAR_NAME=\"test_amax_\" -DCHAR_CNAME=\"test_amax\" -DNO_AFFINITY -I..   -c -o test_amax.o test_amax.c
x86_64-w64-mingw32-gcc -O2 -DMS_ABI -DUTEST_CHECK -DSANITY_CHECK -DREFNAME=test_potrsf_ -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -DSMP_SERVER -DNO_WARMUP -DCONSISTENT_FPCSR -DMAX_CPU_NUMBER=64 -DASMNAME=test_potrs -DASMFNAME=test_potrs_ -DNAME=test_potrs_ -DCNAME=test_potrs -DCHAR_NAME=\"test_potrs_\" -DCHAR_CNAME=\"test_potrs\" -DNO_AFFINITY -I..   -c -o test_potrs.o test_potrs.c
x86_64-w64-mingw32-gcc -O2 -DMS_ABI -DUTEST_CHECK -DSANITY_CHECK -DREFNAME=f_ -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -DSMP_SERVER -DNO_WARMUP -DCONSISTENT_FPCSR -DMAX_CPU_NUMBER=64 -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DCHAR_NAME=\"_\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I.. -o openblas_utest utest_main.o test_amax.o test_potrs.o ../libopenblas_barcelonap-r0.2.19.a -defaultlib:advapi32 -lgfortran -defaultlib:advapi32 -lgfortran -L/usr/lib/gcc/x86_64-w64-mingw32/6.3-win32 -L/usr/lib/gcc/x86_64-w64-mingw32/6.3-win32/../../../../x86_64-w64-mingw32/lib  -lgfortran -lmingw32 -lmoldname -lmingwex -lmsvcrt -lquadmath -lm -lmingw32 -lmoldname -lmingwex -lmsvcrt -lmingw32 -lmoldname -lmingwex -lmsvcrt
../libopenblas_barcelonap-r0.2.19.a(cpotrf_U_single.obj):potrf_U_single.c:(.rdata$.refptr.gotoblas[.refptr.gotoblas]+0x0): undefined reference to `gotoblas'
collect2: error: ld returned 1 exit status
Makefile:21: recipe for target 'openblas_utest' failed
make[1]: *** [openblas_utest] Error 1
make[1]: Leaving directory '/home/user/OpenBLAS_src/utest'
Makefile:111: recipe for target 'tests' failed
make: *** [tests] Error 2
root@debian:/home/user/OpenBLAS_src#

UPD1

make DYNAMIC_ARCH=1 CONSISTENT_FPCSR=1 CC=x86_64-w64-mingw32-gcc FC=x86_64-w64-mingw32-gfortran HOSTCC=gcc NUM_THREADS=64 TARGET=BARCELONA had worked.

However, as far as I understand, DYNAMIC_ARCH=1 is what I want strip in order to better suit my purpose...


UPD2

make clean
make DYNAMIC_ARCH=0 CONSISTENT_FPCSR=1 CC=x86_64-w64-mingw32-gcc FC=x86_64-w64-mingw32-gfortran HOSTCC=gcc NUM_THREADS=64 TARGET=BARCELONA

still fails with the same error as I get without DYNAMIC_ARCH=0 flag :(

I guess, I'm doing something wrong, but have no idea what is it... Could somebody please help?

brada4 commented 7 years ago

Cross-compilation should not run any tests. There is basically no chance they ever return any success. You can find all options and change defaults in Makefile.rule

Arech commented 7 years ago

Cross-compilation should not run any tests. There is basically no chance they ever return any success.

mmm, great, but how to skip them from the build process? I don't see any test-related options in Makefile.rule...

brada4 commented 7 years ago

dynamic arch just like fpcsr is initialized once at library load. It will not speed up your library calls.

brada4 commented 7 years ago

Can we call it a bug? Try something like TARGET=Atom failing at same place or not.

martin-frbg commented 7 years ago

Try removing the make -C utest all line from Makefile as a quick fix (attempt). Wonder if the same still happens with current develop snapshot rather than 0.2.19 (but probably will, 3d50ccdc "allow building tests when CROSS compiling but don't run them" from a year ago may be related - note it seems to be failing in the utest build already)

Arech commented 7 years ago

Try something like TARGET=Atom failing at same place or not.

@brada4 , indeed it fails exactly at the same place.

Try removing the make -C utest all line from Makefile as a quick fix (attempt).

@martin-frbg have just tried, but unfortunately, it fails in a similar manner during another step:

make clean
make DYNAMIC_ARCH=0 CONSISTENT_FPCSR=1 CC=x86_64-w64-mingw32-gcc FC=x86_64-w64-mingw32-gfortran HOSTCC=gcc NUM_THREADS=6 TARGET=BARCELONA
...
make[1]: Leaving directory '/home/user/OpenBLAS_src/lapack-netlib'
touch libopenblas_barcelonap-r0.2.19.a
make[1]: Entering directory '/home/user/OpenBLAS_src/exports'
perl ./gensymbol win2k    x86_64 dummy 0 0 0 0 0 0 "" "" 1 > libopenblas.def
x86_64-w64-mingw32-gcc -O2 -DMS_ABI -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -DSMP_SERVER -DNO_WARMUP -DCONSISTENT_FPCSR -DMAX_CPU_NUMBER=6 -DASMNAME=dllinit -DASMFNAME=dllinit_ -DNAME=dllinit_ -DCNAME=dllinit -DCHAR_NAME=\"dllinit_\" -DCHAR_CNAME=\"dllinit\" -DNO_AFFINITY -I.. -c -o dllinit.obj -s dllinit.c
x86_64-w64-mingw32-ranlib ../libopenblas_barcelonap-r0.2.19.a
x86_64-w64-mingw32-gcc -O2 -DMS_ABI -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -DSMP_SERVER -DNO_WARMUP -DCONSISTENT_FPCSR -DMAX_CPU_NUMBER=6 -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DCHAR_NAME=\"_\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I..  libopenblas.def dllinit.obj \
-shared -o ../libopenblas.dll -Wl,--out-implib,../libopenblas.dll.a \
-Wl,--whole-archive ../libopenblas_barcelonap-r0.2.19.a -Wl,--no-whole-archive -L/usr/lib/gcc/x86_64-w64-mingw32/6.3-win32 -L/usr/lib/gcc/x86_64-w64-mingw32/6.3-win32/../../../../x86_64-w64-mingw32/lib  -lgfortran -lmingw32 -lmoldname -lmingwex -lmsvcrt -lquadmath -lm -lmingw32 -lmoldname -lmingwex -lmsvcrt -lmingw32 -lmoldname -lmingwex -lmsvcrt   -defaultlib:advapi32 -lgfortran -defaultlib:advapi32 -lgfortran
../libopenblas_barcelonap-r0.2.19.a(sgetrf_single.obj):getrf_single.c:(.rdata$.refptr.gotoblas[.refptr.gotoblas]+0x0): undefined reference to `gotoblas'
collect2: error: ld returned 1 exit status
Makefile:93: recipe for target '../libopenblas.dll' failed
make[1]: *** [../libopenblas.dll] Error 1
make[1]: Leaving directory '/home/user/OpenBLAS_src/exports'
Makefile:102: recipe for target 'shared' failed
make: *** [shared] Error 2
root@debian:/home/user/OpenBLAS_src#

Any ideas?

brada4 commented 7 years ago

Please attach full build log (collectable on linux with 'script' command) what you copy says sgetrf-anything is missing, but does not show previous error that it fails to build or something.

martin-frbg commented 7 years ago

Also may be worth trying with a current "develop" snapshot (or wait till 0.2.20 is out, should not be long now hopefully). At least your commandline works fine for me with 0.2.20dev and x86_64-w64-mingw32.static-gcc/gfortran built by mxe

Arech commented 7 years ago

@brada4 Here is the log of make clean && make DYNAMIC_ARCH=0 CONSISTENT_FPCSR=1 CC=x86_64-w64-mingw32-gcc FC=x86_64-w64-mingw32-gfortran HOSTCC=gcc NUM_THREADS=6 TARGET=BARCELONA https://raw.githubusercontent.com/Arech/tmp/master/OpenBLASv0.2.19_build_log.txt

May be it's not relevant anymore, because:

worth trying with a current "develop" snapshot

@martin-frbg works for me too!

Thanks!

brada4 commented 7 years ago

Thanks for confirming 0.2.20 will solve this issue. FPCSR question is still open, and you wish to make it as fast as possible too.

martin-frbg commented 7 years ago

So do I understand correctly that

Arech commented 7 years ago

FPCSR question is still open, and you wish to make it as fast as possible too.

@brada4 , correct.

As you have said, building with DYNAMIC_ARCH=0 doesn't seems to help with a run-time performance. Also, I've just tried to build with -ffast-math and it doesn't seems to help significantly too (which is quite expectable given that the most of performance sensitive code is in an assemly).

Any thoughts on how to improve run-time performance (especially for cblas_sgemm(), cblas_ssyrk() and cblas_ssymm()) are welcomed!

  • OpenBLAS normally will keep using the FP register settings defined in the user-supplied main program(??)
  • due to a mingw defect this does not work with mingw builds unless CONSISTENT_FPCSR is defined at build time

True. Not sure that it's due to mingw defect, but I'm incompetent in it, so may be.

  • saving and restoring FP registers on each thread invocation may add noticeable overhead

I don't know how noticeable it is. Probably not very noticeable. For example, if the Windows maintains different floating point environment for each thread (unfortunately, I don't remember if it's true, but probably it is true), then it has to switch it many times a second during switching threads context.

But no matter of that, what the point of doing it? I hardly imagine an app, that mixes calculations with different FP settings in parallel. It just seems as redundant as an empty loop for(i=0;i<10000;++i) inserted after every other line of code and not thrown away by a compiler. IMHO, CONSISTENT_FPCSR = 1 in most use-cases just a delay that could be avoided, and does nothing more...

  • a convenience function for disabling denormals would still be desirable to allow handling completely within OpenBLAS

Probably, yes. A function to disable denormals is a great solution, but for me would work even a simple environment setting like OBLAS_DISABLE_DENORMALS, that is evaluated only once during lib thread pool startup (provided that there are no other code that might unexpectedly turn denormals on). Then I'll be able to compile with CONSISTENT_FPCSR=0 to drop away FP-state saving+loading code completely.

ADDED: please, note, I don't vote for droping support of CONSISTENT_FPCSR flag in its current form. It might be useful for some users. I just vote for some (very easy to implement) improvement.

  • the current FPCSR handling appears to be specific to gcc/x86 ...

Yes. There's even a corresponding comment in a Makefile.rule:

# If you need to synchronize FP CSR between threads (for x86/x86_64 only).
# CONSISTENT_FPCSR = 1 
brada4 commented 7 years ago

You now have it 10x faster. Do you think 1% more speedup will change anything? Many newer processors do not suffer at all handling denormals (timewise).

How big are your typical input dimensions? If it is under L3 cache size you better go with single-threaded OpenBLAS and thread inside your code yourself.