CUDA version performance improvement

edwinLi-xuewei commented 1 year ago

Hello, thank you for your work on the model. About the CUDA version, after my experiment(GTX1050Ti), it didn't improve the performance. Maybe my env file is too simple, but it didn't fully mobilize the performance of my graphics card. I want to know under what circumstances can the performance be improved by several times? If it is related to the environment, is there a specific env file?

lpisha commented 1 year ago

I apologize for not seeing this until now.

As explained in the readme, the performance of the CUDA version varies and often will be slower than the multithreaded C++ version. This is for two main reasons explained here: https://github.com/A-New-BellHope/bellhopcuda#speed

I think you'll have the best chance of getting a larger performance gain when you have a lot of rays and few receivers (for TL, arrivals, or eigenrays runs). Note that if you're just doing ray trace runs, the CUDA version does not even bother using CUDA for this, as the rays all have to be written serially to the output file, which often takes longer than tracing them.

A GTX 1050 Ti is not particularly powerful anyway--not sure what CPU you're comparing it to or whether this is a desktop or a laptop. What kind of speedup did you get using bellhopcxx multithreaded instead of the original BELLHOP?

Did you try USE_FLOAT? The results may not be reliable in this mode, but it's worth investigating if you care about speed.

edwinLi-xuewei commented 1 year ago

Thank you for your patience in answering the question.

I only try to use the CUDA version when calculating TL. The problem I face is that a single environment (less rays and few receiver), but needs to be calculated lots of environment, in which case the CUDA version is not applicable to me.

I used the multithreaded version of bellhopcxx and looked at the source code. The result of the run is multiple to the number of cores of my CPU, and the completion time is almost the original BELLHOP version time / number of cores of CPU. The test results verify this view.

Because I am more concerned about the correctness of the results, I did not try USE_FLOAT, but later I will also carry out certain tests, if most of my environment will not affect the correctness, I think I will be try.

Thank you again for your contribution to the code and your patient answer.

lpisha commented 1 year ago

We have done more testing on the CUDA version on a development branch which will hopefully be merged soon. This testing is on a Threadripper 12-core / 24-thread CPU and a RTX 3080. The results of this test are that when there are more rays than GPU cores (in the 10,000s or more), the CUDA version can perform several times faster than the multithreaded CPU version, even in double-precision mode. However, when there are few rays (in the 100s), the CUDA version is slower than the CPU version, and sometimes even slower than the original Fortran BELLHOP(3D).

Also please note that if you are on Windows, there was a bug just discovered by one of our collaborators where TL results will often get corrupted due to a mistake in the binary file output setup. I am putting up a new version (v1.12) momentarily which has the fix.

oldstylejoe commented 5 months ago

Closing stale issue.

A-New-BellHope / bellhopcuda

CUDA version performance improvement #12