doyubkim / fluid-engine-dev

Fluid simulation engine for computer graphics applications
https://fluidenginedevelopment.org/
MIT License
1.87k stars 260 forks source link

Threading taking time to setup, slowing down simulations. #202

Closed kentbarber closed 4 years ago

kentbarber commented 5 years ago

Firstly let me say I haven't run any specific numbers yet on this. But when simulating a scene on a 4 core macbook, and only using 2 of those cores (at 100%), it seems to run significantly faster than when I run it on my 10 core i7 windows machine, using 8 of the cores at about 14% of their usage.

I am using JET_TASKING_CPP11THREADS threads.

I believe the slow down is due to the time required to split up the jobs taking more time than it does to actually do a job on a single core, when the job is actually small.

In the code for parallelFor you are dividing up the work (slices) equally between all threads. But there is no threshold for the minimum number of tasks per thread. This means that if I have 10 jobs then each thread gets 1 job. So it starts up 10 threads, distributes the 10 jobs and then finishes.

It would be nice if parallelFor had a default limit (perhaps 100,000) that it would handle before it broke the job up into different threads. And also if it used this same limit to decide how many cores to actually use so that it only uses a new thread if it needs it. Optionally we could pass in a different threshold if we wanted to use it for our own jobs.

For example, lets say the limit was 100,000 items. It would only use threading if it goes over this limit. Then if there are more than 100,000 it would break it up into 2 threads, until it reaches 200,000, then goes to 3 threads.

Some testing should be done to determine what a good number is before splitting up the job into threads.

I will try to find some time to get some real numbers next week with a scene on my 4 core macbook Vs my 10 core i7. But right now the macbook is winning regarding simulation speed since it uses just 2 cores at full 100% compared to my i7 using 14% on 8 of the cores.

So the macbook doesn't spend as much time starting/stopping threads. It lets each core do more work at its full capacity for longer, before switching to a new batch of jobs (ie starting/stopping threads).

doyubkim commented 5 years ago

Thanks for the finding, @kentbarber !

As you already discovered, JET_TASKING_CPP11THREADS is the least efficient way for parallelizing the task. It doesn't have any smart scheduling or job distribution mechanism and is the minimalistic parallelization. I recommend testing out with TBB option and see how it scales. Also, please take a look at PR #145 which shows earlier experiments done by @jeffamstutz.

Even with TBB/OpenMP, however, it may not fully utilize all the cores. The main challenge in numerical simulations like this, in general, is the memory bandwidth -- shallow for-loop such as simple dot product or axpy operation are not really scalable. Carefully crafting the memory access pattern and optimizing the loops could be done.

In summary, please try out the TBB or OpenMP option and please let me know how it worked for your machines.

kentbarber commented 5 years ago

I have been meaning to try out the TBB option for a while now. Does this work well on OSX with TBB? I will see if I can find time to test windows and osx this weekend. I won't use OpenMP since in the past there have been known compatibility issues with C4D so I would rather not enable it.

Also thanks for getting back to me on this issue so quickly. The framework is great!

doyubkim commented 5 years ago

Thanks!

TBB works great on macOS. I normally install it via Homebrew (brew install tbb), and then the CMake should detect it automatically.

I wasn't aware of OpenMP compat issue with C4D. Do you have any references/links to this issue? Is it a problem on both macOS and Windows?

kentbarber commented 5 years ago

I just switched it over to TBB running just now on Windows. Need to do a proper comparison, but the cores now do jump up to 40% for a particular sim, so its an improvement.

Regarding OpenMP, I don't remember the exact issue. But they moved completely away from OpenMP many years back and now have their own custom threading system. So to ensure the least amount of compatibility issues its recommended to not use it. It may have been related to the intel compiler being used for C4D and 3rdparty devs using Visual Studio compiler.