OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.39k stars 1.5k forks source link

openblas and openmp #2265

Closed bill-hager closed 10 months ago

bill-hager commented 5 years ago

I have tried to use openblas with Tim Davis' SuiteSparse package. I have download openblas from either redhat on my dell desktop or from ubuntu on my thinkpad lap; in either case, I have similar problems. The problem occurs when his software tries to perform a supernodal cholesky factorization. This requires use of dgemm in BLAS. On my 32 processor desktop, the time to perform the factorization is 1000 slower than it should be. On my 8 processor laptop, the time is 7 times slower than it should be. When I use profiling, I find that 57% of the time is spent in blas_thread_server and 35% of the time is in alloc_map. If after the factorization is complete, I immediately perform the factorization again, then the time drops to 0.1 seconds on either machine, the correct factorization time (on the 32 processor desktop, the time was 86 seconds for the initial factorization). The current version of SuiteSparse is using OpenMP, so there seems to be some problem with the openmp coding inside the openblas. If I essentially turn off threading with "setenv OMP_NUM_THREADS 1", then the factorization time is 0.2 seconds, and the huge run times were significantly reduced. Nonetheless, the time is still twice what it would be if threading would work. Is it possible to fix dgemm so that with openmp, the multiprocessor threading will work. dgemm in openblas does work correctly with ptreads; it is with openmp threading that it does not seem to work. But again, if call the factorization routine, the initial factorization takes 86 seconds, and then if I immediately refactor the matrix, it takes 0.1 seconds. On the other hand, if I factor the matrix, then exit the routine where I factor the matrix, do some work in other routines, then return to the routine where I call the factorization, it will take another 86 seconds to do the factorization. This drop from 86 seconds to 0.1 seconds only happens if the second factorization occurs immediately after the first one.

martin-frbg commented 5 years ago

Which version(s) of OpenBLAS ? Slowness on (only) the first run makes it sound like some cache contention issue, what are your other OPENMP environment variables ? (Could be related to #1653, which unfortunately has no clear resolution so far)

brada4 commented 5 years ago

What CPU? 32 Processors (That barely fit under the desk) XOR 32 Cores XOR 32 Hyperthreads? About the immediately-ness - are you loading data from the hard drive?

EDIT: what do you mean by "from RedHat" ? They have love to ATLAS, not OpeNBLAS. You can get OpenBLAS from Fedora EPEL v0.3.3, or better do your rpmbuild from Fedora's own 0.3.7 SRPM.

brada4 commented 5 years ago

Typically please compare perf record ./sample ; perf report vs OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 perf record ; perf report

Try to paste together text with what happens inside suitesparse calls and openblas library.

bill-hager commented 5 years ago

Andrew,

The problem that I am describing occurs on multiple computer platforms. Below I will paste the output from /proc/cpuinfo concerning two specific computers on which which have performed my experiments. When the number of processors increases, the problems become more apparent. The data is in memory, it is not being loaded from a hard drive (of course, it had to be brought into memory from a hard drive, but the experiment are done using data in memory). There is lots of memory. Also note that the results I get depend on the specific matrix that is being factored. The specific matrix that I used for the experiments that I described had about 6000 rows and and columns, and the dgemm would have been called many times during the factorization. Here is /proc/cpuinfo for a Dell T7910 computer that I have used. This seems to indicate 8 cores, but 32 processors.

processor    : 31 vendor_id    : GenuineIntel cpu family    : 6 model        : 62 model name    : Intel(R) Xeon(R) CPU E5-2687W v2 @ 3.40GHz stepping    : 4 microcode    : 1064 cpu MHz        : 1200.000 cache size    : 25600 KB physical id    : 1 siblings    : 16 core id        : 11 cpu cores    : 8 apicid        : 55 initial apicid    : 55 fpu        : yes fpu_exception    : yes cpuid level    : 13 wp        : yes flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms bogomips    : 6782.69 clflush size    : 64 cache_alignment    : 64 address sizes    : 46 bits physical, 48 bits virtual

Here is the data for a recently purchased Lenovo X1 thinkpad:

processor    : 7 vendor_id    : GenuineIntel cpu family    : 6 model        : 142 model name    : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz stepping    : 10 microcode    : 0xb4 cpu MHz        : 900.014 cache size    : 8192 KB physical id    : 0 siblings    : 8 core id        : 3 cpu cores    : 4 apicid        : 7 initial apicid    : 7 fpu        : yes fpu_exception    : yes cpuid level    : 22 wp        : yes flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs bogomips    : 4224.00 clflush size    : 64 cache_alignment    : 64 address sizes    : 39 bits physical, 48 bits virtual power management:

On 9/22/19 9:50 AM, Andrew wrote:

EXTERNAL EMAIL: Exercise caution with links and attachments.

What CPU? 32 Processors (That barely fit under the desk) XOR 32 Cores XOR 32 Hyperthreads? About the immediately-ness - are you loading data from the hard drive?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_xianyi_OpenBLAS_issues_2265-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAETYJEMT6JNKJZ3IA37K7B3QK5Z3HA5CNFSM4IZAMYJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7JGWFQ-23issuecomment-2D533883670&d=DwMCaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=qseyYD2-MKCXjosUHueCCw&m=lM6Y-SqF-SHWOyTf-HYvXLkkEAg2VgHgQnv8tj6OslU&s=0rL_0bOqGMj6JoJsEbTZA-bNdvMGW1xrccb7FtKv6XM&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AETYJEN3YLNLOIU6QVWHKALQK5Z3HANCNFSM4IZAMYJA&d=DwMCaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=qseyYD2-MKCXjosUHueCCw&m=lM6Y-SqF-SHWOyTf-HYvXLkkEAg2VgHgQnv8tj6OslU&s=-1QcNDy6e3buN1c4MkFqV8l6D93-Pacc05fUsvrk3tQ&e=.

martin-frbg commented 5 years ago

E5-2687Wv2 will be Sandybridge target, 8 cores/16 threads, and on a two-socket system an added problem could be tasks getting pushed from one socket to the other. i7-8650 will be using Haswell. OpenBLAS version still of interest, as early 0.3.x had some performance issues due to unnecessary locking (though if pthreads performance is normal it is probably not one of those).

brada4 commented 5 years ago

First test w xeon would be to set to 8 cores to use one side of numa without HT pseudocores Does it get close to 8x better than 1 core?

brada4 commented 5 years ago

Include/cholmod_supernodal.h

 * BLAS routines:
 * dtrsv        solve Lx=b or L'x=b, L non-unit diagonal, x and b stride-1
 * dtrsm        solve LX=B or L'X=b, L non-unit diagonal
 * dgemv        y=y-A*x or y=y-A'*x (x and y stride-1)
 * dgemm        C=A*B', C=C-A*B, or C=C-A'*B
 * dsyrk        C=tril(A*A')

dtrsv is not parallel ... rest are guarded ... dsyrk is not guarded against excess parallelism like we had a plan a while ago #1886 , would be nice to re-confirm with the profiler that this is failing.

martin-frbg commented 10 months ago

Whatever went wrong there in 2019... with current OpenBLAS I get to within 5 percent of the speed of the 2024.0 MKL on comparable hardware when running Suitesparse-7.5.1's CHOLMOD on large matrix problems from the SuiteSparse Matrix Collection. The speed difference negligible when the (already suspect) multithreading threshold in GEMV is increased.