Closed kindloaf closed 7 years ago
Some of the more obvious things to check: is cpu clockspeed constant, or subject to thermal throttling ? Are all the cores of the same type, or is this some kind of big-little system where a thread may end up on one of the less powerful cores once in a while ? how many cores/threads are you using ? what else is going on on the system (and may e.g. flush a cpu cache occasionally ) ? is time granularity sufficient to measure runtimes in the millisecond range ? do you see similar variations with the reference BLAS from netlib ?
BTW if you can tell what type of ARM cpu (and which version of OpenBLAS) you use for this, perhaps someone can followup with specific suggestions - not all cpu TARGETs in OpenBLAS are equally well
optimized.
Here are the environments:
(1) The cpu clockspeed (frequency) is fixed, by setting minfrequency and maxfrequency of each core to the maximum value.
(2) 2 cores are slightly better than the other 4 cores, but they have the same frequency. And by testing single-core performance, they are similar regarding cblas_sgemm.
(3) I am using 6 threads, by setting OPENBLAS_NUM_THREADS=6
. Observing through htop, I saw all 6 cores are at 100% I observed unstable run times when I tried 3,4,5,6 threads. Single thread or 2-thread are fine so far.
(4) The system is doing nothing else but calculating cblas_sgemm.
(5) I used gettimeofday before and after each sgemm. Each call to gettimeofday takes ~0.0001 millisecond.
(6) I haven't tried netlib
(1)Please modify governor to powersave to rule put thermal issues. (2) please measure with taskset how much is slightly (3) that is the big.little stuff - actually are 3 cores any faster than 2? (5) thats 0.1ms, clock_gettime() can reach higher resolution clocks. (7) does /proc/cpuinfo (attach if unsure) reflect variation in cores?
@brada4 I just figured out the issue - it's indeed due to big-little cores. When using only big cores or only little cores, the run times are much more stable. Thanks. By the way, for big-little CPUs, is the common practice for OpenBlas to use only big cores or only little cores?
Can you detect big cores and measure/compare with cpuset to answer your question? I am quite weak at remote sensing.
Will do. Thanks.
If you think there is stable way to detect big cores feel free to share best way (was not possible year ago)
@brada4 Here is how I do it: I'm working on a specific CPU, so I read the spec which says there are 2 big cores at certain frequency and 2 little cores at certain frequency. Then I check /proc/cpuinfo, 2 cores have the exact same description and the other 4 cores have the exact same description. So I assume the former 2 are big cores.
Also, when I ran OpenBlas with 2 threads, it always chose the two presumably "big" cores. When I used more than 2 threads, these 2 cores were always chosen each time, and the other cores were chosen randomly. This sort of verified my theory about which core is which.
I am afraid this thoretizing is not helpful. At least cpuinfo or something substantial could help
can you share your makefile,i have some issues
@ctgushiwei I just used the default Makefile. What error did you see?
@kindloaf my cpuinfo is simalar to yours.what is the OpenBlas version dou you use? the develop version or the arm_soft_fp_abi version
@ctgushiwei I used the develop branch.
@kindloaf I can compiled 0.2.19 release version successfully on armv7 ,but when i test cblas_sgemm,i go to segmentation error. i have known the reason,the code at openblas_0.2.19/kernel/ can not compile to .o file,i do not know how to solve this problem. can you help me how to slove this problem or share your makefile and parameters passed to 'make' command
Hi, I am testing the run time for cblas_sgemm. I'm using a 6-core ARM CPU. To get the run time, I ran 1000 times of cblas_sgemm with the same arguments. Surprisingly, the average run times varied greatly between runs. For a matrix multiplication with
m*n*k=~30M
, the average run times could range between1.5ms
to9.6 ms
. Is the run time fluctuation reasonable?