Closed tbalke closed 3 years ago
If you want actual vs logical cores, use:
from psutil import cpu_count
...
num_threads = cpu_count(logical=False)
With Brendt's advice, I used num_threads = cpu_count(logical=True) instead of num_threads = cpu_count(logical=False).
In my computer, num_threads = cpu_count(logical=True) performs a faster result.
The optimal thread count will depend on the data so we should not try to tune it for the demo data.
I think cpu_count(logical=False) would be a reasonable default choice for most data.
I assume that you have 4 cores on the machine you tested this on?
Given it is p=4 times as many threads, a not even doubling of the speed is understandable but not necessarily super impressive efficiency.
I assume you used a small 2D problem. When you use a large 3D problem the efficiency = T_1 / (p T_p) would likely be better.
But how about using 8? Maybe that is faster.
Maybe 8 being faster also is contingent on how large the problem is.
Don’t get me wrong I think that p=cpu_count is probably the right number but we may want to double check.
Thilo
On Tue, Nov 3, 2020 at 5:11 PM Wenrui Li notifications@github.com wrote:
With Brendt's advice, I used num_threads = cpu_count(logical=True) instead of num_threads = cpu_count(logical=False).
In my computer, num_threads = cpu_count(logical=True) performs a faster result.
- num_threads=1 takes 17s for 20 iterations reconstruction.
- num_threads=4 takes 10s for 20 iterations reconstruction
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cabouman/svmbir/issues/29#issuecomment-721440899, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEJN3I7GTG5GQAFQSNOILLSOCL4FANCNFSM4STZGLCQ .
I didn’t really understand below statement. Can you give me an example?
I assume you used a small 2D problem. When you use a large 3D problem the efficiency = T_1 / (p T_p) would likely be better.
I think we can summarize solutions and tested them.
Also, I think using p=cpu_count is better than just set that to 1. So as a default num_threads is fine. But you are right, we need double check.
Best Wenrui
On Tue, 3 Nov 2020 at 20:02, Thilo Balke notifications@github.com wrote:
I assume that you have 4 cores on the machine you tested this on?
Given it is p=4 times as many threads, a not even doubling of the speed is understandable but not necessarily super impressive efficiency.
I assume you used a small 2D problem. When you use a large 3D problem the efficiency = T_1 / (p T_p) would likely be better.
But how about using 8? Maybe that is faster.
Maybe 8 being faster also is contingent on how large the problem is.
Don’t get me wrong I think that p=cpu_count is probably the right number but we may want to double check.
Thilo
On Tue, Nov 3, 2020 at 5:11 PM Wenrui Li notifications@github.com wrote:
With Brendt's advice, I used num_threads = cpu_count(logical=True) instead of num_threads = cpu_count(logical=False).
In my computer, num_threads = cpu_count(logical=True) performs a faster result.
- num_threads=1 takes 17s for 20 iterations reconstruction.
- num_threads=4 takes 10s for 20 iterations reconstruction
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cabouman/svmbir/issues/29#issuecomment-721440899, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AGEJN3I7GTG5GQAFQSNOILLSOCL4FANCNFSM4STZGLCQ
.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/cabouman/svmbir/issues/29#issuecomment-721454542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJDA4QUNV346ZMTRKXIMDTSOCR2NANCNFSM4STZGLCQ .
Hello Soumendu,
Did you have any data I can use to test this?
Best Wenrui
On Tue, 3 Nov 2020 at 20:12, Damon Li liwr5damon@gmail.com wrote:
I didn’t really understand below statement. Can you give me an example?
I assume you used a small 2D problem. When you use a large 3D problem the efficiency = T_1 / (p T_p) would likely be better.
I think we can summarize solutions and tested them.
Also, I think using p=cpu_count is better than just set that to 1. So as a default num_threads is fine. But you are right, we need double check.
Best Wenrui
On Tue, 3 Nov 2020 at 20:02, Thilo Balke notifications@github.com wrote:
I assume that you have 4 cores on the machine you tested this on?
Given it is p=4 times as many threads, a not even doubling of the speed is understandable but not necessarily super impressive efficiency.
I assume you used a small 2D problem. When you use a large 3D problem the efficiency = T_1 / (p T_p) would likely be better.
But how about using 8? Maybe that is faster.
Maybe 8 being faster also is contingent on how large the problem is.
Don’t get me wrong I think that p=cpu_count is probably the right number but we may want to double check.
Thilo
On Tue, Nov 3, 2020 at 5:11 PM Wenrui Li notifications@github.com wrote:
With Brendt's advice, I used num_threads = cpu_count(logical=True) instead of num_threads = cpu_count(logical=False).
In my computer, num_threads = cpu_count(logical=True) performs a faster result.
- num_threads=1 takes 17s for 20 iterations reconstruction.
- num_threads=4 takes 10s for 20 iterations reconstruction
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cabouman/svmbir/issues/29#issuecomment-721440899, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AGEJN3I7GTG5GQAFQSNOILLSOCL4FANCNFSM4STZGLCQ
.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/cabouman/svmbir/issues/29#issuecomment-721454542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJDA4QUNV346ZMTRKXIMDTSOCR2NANCNFSM4STZGLCQ .
Wenrui,
didn’t really understand below statement. Can you give me an example?
I assume you used a small 2D problem. When you use a large 3D problem the efficiency = T_1 / (p T_p) would likely be better.
In parallel computing efficiency https://www.quora.com/What-is-the-definition-of-efficiency-in-parallel-computing is a often used quantity used to analyze algorithm scalability. We probably want to have a high speedup (S) and also a high efficiency (E) = S/p = T_1/(p T_p) for a large set of problem sizes when choosing p. Here T_p is the time it takes with p threads. Often E rises with larger problems. (Imagine using 24 cores to compute the sum of 24 numbers. That won't be much faster than when using 1 core (or slower). However when using a large array, many cores will help.)
Did you have any data I can use to test this?
We have the phantom right? We could just create our own data by using the project function and a larger view angle list. Also stacking up the 2D reconstruction multiple times like x_3D = np.stack([x_2D for i in range(num_slices)]) would give as a 3D volume.
Thilo
On Tue, Nov 3, 2020 at 6:14 PM Wenrui Li notifications@github.com wrote:
Hello Soumendu,
Did you have any data I can use to test this?
Best Wenrui
On Tue, 3 Nov 2020 at 20:12, Damon Li liwr5damon@gmail.com wrote:
I didn’t really understand below statement. Can you give me an example?
I assume you used a small 2D problem. When you use a large 3D problem the efficiency = T_1 / (p T_p) would likely be better.
I think we can summarize solutions and tested them.
Also, I think using p=cpu_count is better than just set that to 1. So as a default num_threads is fine. But you are right, we need double check.
Best Wenrui
On Tue, 3 Nov 2020 at 20:02, Thilo Balke notifications@github.com wrote:
I assume that you have 4 cores on the machine you tested this on?
Given it is p=4 times as many threads, a not even doubling of the speed is understandable but not necessarily super impressive efficiency.
I assume you used a small 2D problem. When you use a large 3D problem the efficiency = T_1 / (p T_p) would likely be better.
But how about using 8? Maybe that is faster.
Maybe 8 being faster also is contingent on how large the problem is.
Don’t get me wrong I think that p=cpu_count is probably the right number but we may want to double check.
Thilo
On Tue, Nov 3, 2020 at 5:11 PM Wenrui Li notifications@github.com wrote:
With Brendt's advice, I used num_threads = cpu_count(logical=True) instead of num_threads = cpu_count(logical=False).
In my computer, num_threads = cpu_count(logical=True) performs a faster result.
- num_threads=1 takes 17s for 20 iterations reconstruction.
- num_threads=4 takes 10s for 20 iterations reconstruction
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/cabouman/svmbir/issues/29#issuecomment-721440899 , or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AGEJN3I7GTG5GQAFQSNOILLSOCL4FANCNFSM4STZGLCQ
.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/cabouman/svmbir/issues/29#issuecomment-721454542, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AFJDA4QUNV346ZMTRKXIMDTSOCR2NANCNFSM4STZGLCQ
.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cabouman/svmbir/issues/29#issuecomment-721457591, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEJN3PF27AX74JBE3O677LSOCTHTANCNFSM4STZGLCQ .
I think we should not overthink the choice of default number of threads. As long as the choice is not 'completely stupid', I think we are good.
The 'best' number of threads would depend on the data size which we don't know. I think cpu_count(logical=False) will be reasonable for most cases. Expert users can tune it themselves.
If the default is too low (eg: 1) then a naive user may feel our code is too slow for large data. If it is too high (eg: much more than the number of cores) the computer might freeze and the user may not like it. I think cpu_count(logical=False) will be reasonable.
I agree with Soumendu. We could do a brief check on a large problem (say 200x200x200) and check whether cpu_count(logical=True) gives a better result. If not (most likely), we can just use cpu_count(logical=False).
Otherwise we can discuss the results.
Thilo
On Tue, Nov 3, 2020 at 6:24 PM Soumendu Majee notifications@github.com wrote:
I think we should not overthink the choice of default number of threads. As long as the choice is not 'completely stupid', I think we are good.
The 'best' number of threads would depend on the data size which we don't know. I think cpu_count(logical=False) will be reasonable for most cases. Expert users can tune it themselves.
If the default is too low (eg: 1) then a naive user may feel our code is too slow for large data. If it is too high (eg: much more than the number of cores) the computer might freeze and the user may not like it. I think cpu_count(logical=False) will be reasonable.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cabouman/svmbir/issues/29#issuecomment-721460456, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEJN3JD3LPO7RONJFEAWLLSOCUNZANCNFSM4STZGLCQ .
My expectation is that the results will be CPU-dependent, so I don't think that a single test will resolve this question. If the code really does consistently make good use of logical cores, then logical=True
is a better choice, but otherwise I would view logical=False
as a safer default. If using logical cores does end up looking consistently better, then I would consider using os.cpu_count()
, which always counts logical cores, to avoid introducing the additional dependency on psutil
.
Keep in mind also if you're testing the speed of this to look at how fast it's converging. If you're running with the default settings of no stopping condition and a fixed number of iterations this won't be obvious. Rather put a stopping condition of like 0.1 % and increase the max_iterations to something large.
I suspect, but without evidence yet, that running with more threads than physical cores might slightly speed up the per-iteration time, but could negatively affect the convergence because the changes to the error sinogram are buffered until the threads are synchronized. If there isn't a significant speed up from the additional threads, it's effectively just delaying the update of the error sinogram which probably isn't helpful.
This raises the issue of standard data sets for testing the code. We need to identify some public data sets that we can put into numpy array format and post in public locations for testing the code.
Can we use one of the data sets on TomoBank to set up a standard test of our code? https://tomobank.readthedocs.io/en/latest/index.html
Also, we need to incorporate the use of travis
to do this.
For a memory bounded application, such as sv-mbir, it's usually a good idea to set the number of threads equal to the number of physical cores. Hyper-threading is usually not good for a memory bounded application because more threads lead to more competition for the limited memory. Another issue is the convergence issue. More threads lead to a slower convergence and hyper-threading can slow down the convergence.
Xiao, thanks for the expert advice.
Let's just implement num_threads = cpu_count(logical=False)
for now, and put this to bed.
(not 1)
but maybe python's multiprocessing.cpu_count() (Number of hyper threaded cores) or number of nominal cores?