cabouman / svmbir

Fast code for parallel or fan beam tomographic reconstruction
BSD 3-Clause "New" or "Revised" License
19 stars 8 forks source link

num_threads should default to a "good" number #29

Closed tbalke closed 3 years ago

tbalke commented 3 years ago

(not 1)

but maybe python's multiprocessing.cpu_count() (Number of hyper threaded cores) or number of nominal cores?

bwohlberg commented 3 years ago

If you want actual vs logical cores, use:

from psutil import cpu_count
...
num_threads = cpu_count(logical=False)
DamonLee5 commented 3 years ago

With Brendt's advice, I used num_threads = cpu_count(logical=True) instead of num_threads = cpu_count(logical=False).

In my computer, num_threads = cpu_count(logical=True) performs a faster result.

smajee commented 3 years ago

The optimal thread count will depend on the data so we should not try to tune it for the demo data.

I think cpu_count(logical=False) would be a reasonable default choice for most data.

tbalke commented 3 years ago

I assume that you have 4 cores on the machine you tested this on?

Given it is p=4 times as many threads, a not even doubling of the speed is understandable but not necessarily super impressive efficiency.

I assume you used a small 2D problem. When you use a large 3D problem the efficiency = T_1 / (p T_p) would likely be better.

But how about using 8? Maybe that is faster.

Maybe 8 being faster also is contingent on how large the problem is.

Don’t get me wrong I think that p=cpu_count is probably the right number but we may want to double check.

Thilo

On Tue, Nov 3, 2020 at 5:11 PM Wenrui Li notifications@github.com wrote:

With Brendt's advice, I used num_threads = cpu_count(logical=True) instead of num_threads = cpu_count(logical=False).

In my computer, num_threads = cpu_count(logical=True) performs a faster result.

  • num_threads=1 takes 17s for 20 iterations reconstruction.
  • num_threads=4 takes 10s for 20 iterations reconstruction

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cabouman/svmbir/issues/29#issuecomment-721440899, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEJN3I7GTG5GQAFQSNOILLSOCL4FANCNFSM4STZGLCQ .

DamonLee5 commented 3 years ago

I didn’t really understand below statement. Can you give me an example?

I assume you used a small 2D problem. When you use a large 3D problem the efficiency = T_1 / (p T_p) would likely be better.

I think we can summarize solutions and tested them.

Also, I think using p=cpu_count is better than just set that to 1. So as a default num_threads is fine. But you are right, we need double check.

Best Wenrui

On Tue, 3 Nov 2020 at 20:02, Thilo Balke notifications@github.com wrote:

I assume that you have 4 cores on the machine you tested this on?

Given it is p=4 times as many threads, a not even doubling of the speed is understandable but not necessarily super impressive efficiency.

I assume you used a small 2D problem. When you use a large 3D problem the efficiency = T_1 / (p T_p) would likely be better.

But how about using 8? Maybe that is faster.

Maybe 8 being faster also is contingent on how large the problem is.

Don’t get me wrong I think that p=cpu_count is probably the right number but we may want to double check.

Thilo

On Tue, Nov 3, 2020 at 5:11 PM Wenrui Li notifications@github.com wrote:

With Brendt's advice, I used num_threads = cpu_count(logical=True) instead of num_threads = cpu_count(logical=False).

In my computer, num_threads = cpu_count(logical=True) performs a faster result.

  • num_threads=1 takes 17s for 20 iterations reconstruction.
  • num_threads=4 takes 10s for 20 iterations reconstruction

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cabouman/svmbir/issues/29#issuecomment-721440899, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AGEJN3I7GTG5GQAFQSNOILLSOCL4FANCNFSM4STZGLCQ

.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/cabouman/svmbir/issues/29#issuecomment-721454542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJDA4QUNV346ZMTRKXIMDTSOCR2NANCNFSM4STZGLCQ .

DamonLee5 commented 3 years ago

Hello Soumendu,

Did you have any data I can use to test this?

Best Wenrui

On Tue, 3 Nov 2020 at 20:12, Damon Li liwr5damon@gmail.com wrote:

I didn’t really understand below statement. Can you give me an example?

I assume you used a small 2D problem. When you use a large 3D problem the efficiency = T_1 / (p T_p) would likely be better.

I think we can summarize solutions and tested them.

Also, I think using p=cpu_count is better than just set that to 1. So as a default num_threads is fine. But you are right, we need double check.

Best Wenrui

On Tue, 3 Nov 2020 at 20:02, Thilo Balke notifications@github.com wrote:

I assume that you have 4 cores on the machine you tested this on?

Given it is p=4 times as many threads, a not even doubling of the speed is understandable but not necessarily super impressive efficiency.

I assume you used a small 2D problem. When you use a large 3D problem the efficiency = T_1 / (p T_p) would likely be better.

But how about using 8? Maybe that is faster.

Maybe 8 being faster also is contingent on how large the problem is.

Don’t get me wrong I think that p=cpu_count is probably the right number but we may want to double check.

Thilo

On Tue, Nov 3, 2020 at 5:11 PM Wenrui Li notifications@github.com wrote:

With Brendt's advice, I used num_threads = cpu_count(logical=True) instead of num_threads = cpu_count(logical=False).

In my computer, num_threads = cpu_count(logical=True) performs a faster result.

  • num_threads=1 takes 17s for 20 iterations reconstruction.
  • num_threads=4 takes 10s for 20 iterations reconstruction

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cabouman/svmbir/issues/29#issuecomment-721440899, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AGEJN3I7GTG5GQAFQSNOILLSOCL4FANCNFSM4STZGLCQ

.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/cabouman/svmbir/issues/29#issuecomment-721454542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJDA4QUNV346ZMTRKXIMDTSOCR2NANCNFSM4STZGLCQ .

tbalke commented 3 years ago

Wenrui,

didn’t really understand below statement. Can you give me an example?

I assume you used a small 2D problem. When you use a large 3D problem the efficiency = T_1 / (p T_p) would likely be better.

In parallel computing efficiency https://www.quora.com/What-is-the-definition-of-efficiency-in-parallel-computing is a often used quantity used to analyze algorithm scalability. We probably want to have a high speedup (S) and also a high efficiency (E) = S/p = T_1/(p T_p) for a large set of problem sizes when choosing p. Here T_p is the time it takes with p threads. Often E rises with larger problems. (Imagine using 24 cores to compute the sum of 24 numbers. That won't be much faster than when using 1 core (or slower). However when using a large array, many cores will help.)

Did you have any data I can use to test this?

We have the phantom right? We could just create our own data by using the project function and a larger view angle list. Also stacking up the 2D reconstruction multiple times like x_3D = np.stack([x_2D for i in range(num_slices)]) would give as a 3D volume.

Thilo

On Tue, Nov 3, 2020 at 6:14 PM Wenrui Li notifications@github.com wrote:

Hello Soumendu,

Did you have any data I can use to test this?

Best Wenrui

On Tue, 3 Nov 2020 at 20:12, Damon Li liwr5damon@gmail.com wrote:

I didn’t really understand below statement. Can you give me an example?

I assume you used a small 2D problem. When you use a large 3D problem the efficiency = T_1 / (p T_p) would likely be better.

I think we can summarize solutions and tested them.

Also, I think using p=cpu_count is better than just set that to 1. So as a default num_threads is fine. But you are right, we need double check.

Best Wenrui

On Tue, 3 Nov 2020 at 20:02, Thilo Balke notifications@github.com wrote:

I assume that you have 4 cores on the machine you tested this on?

Given it is p=4 times as many threads, a not even doubling of the speed is understandable but not necessarily super impressive efficiency.

I assume you used a small 2D problem. When you use a large 3D problem the efficiency = T_1 / (p T_p) would likely be better.

But how about using 8? Maybe that is faster.

Maybe 8 being faster also is contingent on how large the problem is.

Don’t get me wrong I think that p=cpu_count is probably the right number but we may want to double check.

Thilo

On Tue, Nov 3, 2020 at 5:11 PM Wenrui Li notifications@github.com wrote:

With Brendt's advice, I used num_threads = cpu_count(logical=True) instead of num_threads = cpu_count(logical=False).

In my computer, num_threads = cpu_count(logical=True) performs a faster result.

  • num_threads=1 takes 17s for 20 iterations reconstruction.
  • num_threads=4 takes 10s for 20 iterations reconstruction

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/cabouman/svmbir/issues/29#issuecomment-721440899 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AGEJN3I7GTG5GQAFQSNOILLSOCL4FANCNFSM4STZGLCQ

.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/cabouman/svmbir/issues/29#issuecomment-721454542, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AFJDA4QUNV346ZMTRKXIMDTSOCR2NANCNFSM4STZGLCQ

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cabouman/svmbir/issues/29#issuecomment-721457591, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEJN3PF27AX74JBE3O677LSOCTHTANCNFSM4STZGLCQ .

smajee commented 3 years ago

I think we should not overthink the choice of default number of threads. As long as the choice is not 'completely stupid', I think we are good.

The 'best' number of threads would depend on the data size which we don't know. I think cpu_count(logical=False) will be reasonable for most cases. Expert users can tune it themselves.

If the default is too low (eg: 1) then a naive user may feel our code is too slow for large data. If it is too high (eg: much more than the number of cores) the computer might freeze and the user may not like it. I think cpu_count(logical=False) will be reasonable.

tbalke commented 3 years ago

I agree with Soumendu. We could do a brief check on a large problem (say 200x200x200) and check whether cpu_count(logical=True) gives a better result. If not (most likely), we can just use cpu_count(logical=False).

Otherwise we can discuss the results.

Thilo

On Tue, Nov 3, 2020 at 6:24 PM Soumendu Majee notifications@github.com wrote:

I think we should not overthink the choice of default number of threads. As long as the choice is not 'completely stupid', I think we are good.

The 'best' number of threads would depend on the data size which we don't know. I think cpu_count(logical=False) will be reasonable for most cases. Expert users can tune it themselves.

If the default is too low (eg: 1) then a naive user may feel our code is too slow for large data. If it is too high (eg: much more than the number of cores) the computer might freeze and the user may not like it. I think cpu_count(logical=False) will be reasonable.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cabouman/svmbir/issues/29#issuecomment-721460456, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEJN3JD3LPO7RONJFEAWLLSOCUNZANCNFSM4STZGLCQ .

bwohlberg commented 3 years ago

My expectation is that the results will be CPU-dependent, so I don't think that a single test will resolve this question. If the code really does consistently make good use of logical cores, then logical=True is a better choice, but otherwise I would view logical=False as a safer default. If using logical cores does end up looking consistently better, then I would consider using os.cpu_count(), which always counts logical cores, to avoid introducing the additional dependency on psutil.

sjkisner commented 3 years ago

Keep in mind also if you're testing the speed of this to look at how fast it's converging. If you're running with the default settings of no stopping condition and a fixed number of iterations this won't be obvious. Rather put a stopping condition of like 0.1 % and increase the max_iterations to something large.

I suspect, but without evidence yet, that running with more threads than physical cores might slightly speed up the per-iteration time, but could negatively affect the convergence because the changes to the error sinogram are buffered until the threads are synchronized. If there isn't a significant speed up from the additional threads, it's effectively just delaying the update of the error sinogram which probably isn't helpful.

cabouman commented 3 years ago

This raises the issue of standard data sets for testing the code. We need to identify some public data sets that we can put into numpy array format and post in public locations for testing the code.

Can we use one of the data sets on TomoBank to set up a standard test of our code? https://tomobank.readthedocs.io/en/latest/index.html

Also, we need to incorporate the use of travis to do this.

XiaoWang-Github commented 3 years ago

For a memory bounded application, such as sv-mbir, it's usually a good idea to set the number of threads equal to the number of physical cores. Hyper-threading is usually not good for a memory bounded application because more threads lead to more competition for the limited memory. Another issue is the convergence issue. More threads lead to a slower convergence and hyper-threading can slow down the convergence.

cabouman commented 3 years ago

Xiao, thanks for the expert advice.

Let's just implement num_threads = cpu_count(logical=False) for now, and put this to bed.