OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.2k stars 1.47k forks source link

gemm execution time is unstable in multi process system #4651

Open pocarisweat257 opened 3 months ago

pocarisweat257 commented 3 months ago

Hello, We are using NVIDIA Jetson Orin platform and in a multi-process system, each process is assigned to a specific cpu and parallelizes cblas_sgem() sequentially through the remaining cores. When the calculation is performed with only one process created, the execution time of cblas_sgem() is almost constant, but when the process increases, the calculation time becomes unstable. Although cblas_sgem() using only one thread took about 25 ms, parallel processing with 6 threads each in 6 processes causes instability in running time from 10 ms to 60 ms.

A detailed description of the situation We have 11 cpu cores from no. 1 to no. 11. Create six processes, allocate them from cpu 1 to cpu 6, and synchronize the operation start time and execute them simultaneously. Locked through Semaphore in the middle and sequentially parallelized cblas_sgem() with a total of 6 threads using 5 remaining cores from cpu 7 to cpu 11. In this case, cblas_sgem() executed in the process allocated to cores 1 to 5 is about 11 ms, and the execution time is constant. However, in the last process assigned to core 6, cblas_sgem() lags unsteadily from 30 ms to 60 ms, slower than when running with one thread. The same code is executed for cores 1 to 6.

The code currently in use is as follows.

openblas_thread = 6; openblas_set_num_threads (openblas_thread); // Set the number of openblas threads to 6

CPU_ZERO(&cpuset); CPU_SET(sched_getcpu(), &cpuset); pthread_setaffinity_np (pthread_self(), sizeof(cpuset), &cpuset); // reallocate itself to the original assigned CPU number (1 to 6, respectively)

for (int k = 0; k < openblas_thread-1; k++) { // allocate openblas_thread to CPU cores (7 to 11), excluding the main process(thread) CPU_ZERO(&cpuset); CPU_SET(11 - k, &cpuset); openblas_setaffinity(k, sizeof(cpuset), &cpuset); }

for(int i = 0; i < 30; i++) { gemm_wrapper_func(); }

When I measured the execution time inside the openblas code, it was confirmed that the calculation time became unstable when processing cblas_sgem() called in the wrapper function. Is the code above correct to use openblas normally? Screenshot from 2024-04-16 23-14-44

martin-frbg commented 3 months ago

I'm not sure if I understand your test setup correctly - "six threads each in six processes" reads a bit like the total number of threads exceeds the capacity of your cpu ? That said, current OpenBLAS does not specifically recognize the Cortex A78(?) of the Jetson Orin (treating it as some generic ARMV8 cpu), so GEMM parameters tuned to cache size will probably be wrong

pocarisweat257 commented 3 months ago

Thanks for your reply. We conducted the test with a different configuration.

Only two processes were created and assigned to cpu 1 and 2, and the number of threads to be used for cblas_sgemm() in each process was set to two using openblas_set_num_threads (2). However, likewise, the execution time at cpu 2 was measured longer than the execution time at cpu 1.

OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)

  OS               ... Linux             
  Architecture     ... arm64               
  BINARY           ... 64bit                 
  C compiler       ... GCC  (cmd & version : cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0)
  Fortran compiler ... GFORTRAN  (cmd & version : GNU Fortran (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0)
  Library Name     ... libopenblas_armv8p-r0.3.23.dev.a (Multi-threading; Max num-threads is 11)

According to the build results, it seems to recognize ARM64.

lock();
openblas_thread = 2;
openblas_set_num_threads (openblas_thread);

CPU_ZERO(&cpuset);
CPU_SET(sched_getcpu(), &cpuset);
pthread_setaffinity_np (pthread_self(), sizeof(cpuset), &cpuset);

CPU_ZERO(&cpuset);
CPU_SET(11, &cpuset);
openblas_setaffinity(0, sizeof(cpuset), &cpuset);

for(int i = 0; i < 30; i++) {
    gemm_wrapper_func();
}
unlock();

openblas_set_num_threads (1);
CPU_ZERO(&cpuset);
CPU_SET(sched_getcpu(), &cpuset);
pthread_setaffinity_np (pthread_self(), sizeof(cpuset), &cpuset);

for(int i = 30; i < 60; i++) {
    gemm_wrapper_func();
}

In our current code, we change the number of threads repeatedly at runtime. Multi threaded gemm operations are processed and then we change the number of thread to 1 so that the remaining gemm operations are processed in single thread. Lock is used to prevent 'parallel operation' and 'single thread operation' timing from overlapping in two processes. CPU assigned openblas_thread for parallel operation is the same as cpu 11 in both processes.

Will continuing to change the number of openblas threads in this way at runtime affect the execution time in certain processes?