Closed szx0112 closed 5 years ago
Hi @szx0112,
that is expectable behavior: in the one case you're using 10 threads in total (which is less than your physical cores or hyperthreading cores) and in the other case you're using 40 threads. In that case, more time is spent for context-switching than for processing. If you use 4/4 (outer/inner) threads in the nested case, I could imagine things to work similarly well than for just the outer parallelization. You may have to play around a little with the thread usage, because many factors play into efficiency here (concurrent IO can be a big bottleneck for example) and more is not always better.
In general, I recommend using pymp at the 'uppermost' loop level for higher efficiency.
When I use the nested for loop on my desktop (16 cores) with outer loop use 4 cores and inner loop with 10 cores.
I found it actually slower than I just use the pymp in the inner loop (10 cores) with the outer loop use normal for loop in python
nested loops pymp: 50 sec single inner loop use pymp: 25 sec
I found when I run the nested loops, all 14 cpus are working but only at 10-20% usage rate while the single pymp loop can reach to 100% usage for each of 10 cores.
Any comment can help this problem is appreciated.