Closed CGLemon closed 3 years ago
Hi @CGLemon, thanks for your interest in improving sai. Have you tried executing sai with many threads to see if there is a sensible benefit in using shared mutexes? Moreover, can you compare your implementation of a shared mutex against the std::shared_mutex class in C++ 17 ?
In my test, My shared mutex is significantly faster than std::mutex and std::shared_mutex. But I only tested it on my computer, intel-i5 dual cores cpu. I do not ensure that it will be well on other computers. I upload my test code to Google Drive. You may check the code on your computers. Be sure that my shared mutex is good or not. https://drive.google.com/drive/folders/1NhOTcEdRWOovcA9G3xr8a-OrQtvVDjgt?usp=sharing
I have only normal computer with intel i5 cpu. The cpu is not enough for large threads. So I do not test sai with large threads. But my shared mutex is same design as the UCTNode atomic lock. GCP claimed that the atomic lock can speed up the search with large threads by avoiding big lock. I guessed that shared mutex would get same result with large threads because it avoid big NNCache lock.
I am trying to compile the commit locally, and there is a warning
SharedMutex.h:53:10: warning: unused variable ‘v’
Is it possible to remove it without deleting the assert completely?
I've tested the commit and everything seems to work fine. I have measured a marginal speed increase. I am very in favor of pulling this commit.
@CGLemon there appear to be an occasional problem with this commit, see https://github.com/sai-dev/sai/issues/142. We asked Cabu to test different configurations and the malfunctioning was restricted to the shared mutex commits. So for now we are reverting to the previous code, with release 0.18.2, since the typical increase in speed is not very visible.
I don't understand enough of multithreading to give a real contribution, but maybe you or @amato-gianluca can come up with a work-around? I think that even if shared mutex is not very important for self-plays, it could be important for everyday use in game analysis.
@Vandertic Very Thanks. After I saw the issue, I quickly check it. Yes. it would stall in large threads(threads >= 10, Device: intel i5-9400f with RTX 2070s). I fixed the shared lock last night. Now it looks better and do not stall again in large threads. I will push it late. But this still confused me. In my test, the shared mutex just stalled off it 10% per move. In issue #142, it is greater than 50%. Do you have any idea?
Update: After fixing the shared lock. the performance was better. But it still stalled in large threads(less than 5%).
@CGLemon If you want to run some sai versions with some log/trace that could help you pinpoint the problem, i would be glad to help you.
@CGLemon I have some perplexities on the implementation of shared mutexes in your library. When the library is not able to acquire the lock, it calls std::this_thread::sleep_for and waits for 100 microseconds. This seems to be highly inefficient. Actually:
I think the implementation should use some low level OS primitive (such as futexes in Linux) to improve this behaviour. For this reason, I really think it would be better to use the standard C++17 implementation of shared mutexes that (I assume) is optimized for the different operating systems.
@amato-gianluca The lock take a break after doing "test and set". The reason is that the useless "test and set" still consumes CPU resource. The "test and set" is very quick. If it don't sleep, it will do it too many times before unlocking. Sleeping some time avoid the useless thread to occupy CPU cores. On the other hand, Sleeping some time is same effect as "std::this_thread::yield". You may remove the "sleep_for" and test it. You could find that it is worse. I not quite understand the behavior of locks. Actually, I am not a computer science student. This conclusion is just my experience. Sorry for that.
Is Standard C++17 shared mutex better? I am not sure. Because it is bad if too many x-lock. But It is very fast if most lock is s-lock. I will test it in sai (18.2) tomorrow. A big problem is that I have no windows. I can't compare with different OS.
I do not suggest removing sleep_for
or replacing it with yield
, but using some OS synchronization primitives such as futexes. When a thread suspends on a futex, it is only woken when it can enter the critial region, which should be better than just waking from time to time and checking if the mutex is available. Howerver, these synchronization primitives changes a lot from one OS to the other, therefore a cross platform code would need to handle a lot of special cases, depending on the platform. Since this job has been already done in the C++ 17 standard library, I do not see any reason not to use the standard shared_mute
instead of custom code which is harder to maintain and not as thoroughly tested.
I am also trying to get some numbers on the effective overhead caused by the exclusive lock in NNCache, to understand if an optimization is worth the pain. I have some preliminary results obtained with SystemTap. By tracing the execution of
sai -w 7197c04e4716eaf3d073630e8cae658f24de4a1c6b40627dd23b90a29561e8dc.gz -t 4
along four go
commands, I obtained the following:
mutex 0x18f3d60 (0x18f3d60) contended 19 times, 11 avg usec, 150 max usec, 217 total usec, popup at __pthread_mutex_unlock_usercnt+0x89 [libpthread-2.32.so] _ZN7Network10get_outputEPK9GameStateNS_8EnsembleEibbb+0x17b [sai] _ZN7UCTNode15create_childrenER7NetworkRSt6atomicIiER9GameStateRfS7_S7_f+0xf6 [sai] mutex was last contended at __lll_lock_wait+0x30 [libpthread-2.32.so] __pthread_mutex_lock+0xe3 [libpthread-2.32.so] _ZN7NNCache6insertEmRKNS_9NetresultE+0x2d [sai] _ZN7Network10get_outputEPK9GameStateNS_8EnsembleEibbb+0x17b [sai] _ZN7UCTNode15create_childrenER7NetworkRSt6atomicIiER9GameStateRfS7_S7_f+0xf6 [sai]
It says that the mutex used in NNCache::insert
has been contented 19 times. The total time threads have been suspended waiting for mutex is 217 usec. Note also that the average waiting time is 11 usec, much smaller than the 100 usec wait time you use in sleep_for
.
However, this is not a definitive answer, since:
Network::get_output
using the same mutex as NNCache::insert
is unexpected.To see a more important use of NNCache, try after some go
s to give undo
a couple of times and then go
again, or a reasonable alternative move and then go
, so that the cache really needs to be used.
@amato-gianluca Following your suggestions. I optimize the shared mutex by without sleeping. Now it is better than std::mutex. Another, I add c++ 17 shared mutex. It is also better than std::mutex. Here is it #143.
But this test is not complete. I will do other tests.
Implement shared mutex for NNCache. Maybe it will speed up when the large threads?