CuriosAI / sai

SAI: a fork of Leela Zero with variable komi.
GNU General Public License v3.0
104 stars 11 forks source link

Implement shared mutex #138

Closed CGLemon closed 3 years ago

CGLemon commented 3 years ago

Implement shared mutex for NNCache. Maybe it will speed up when the large threads?

amato-gianluca commented 3 years ago

Hi @CGLemon, thanks for your interest in improving sai. Have you tried executing sai with many threads to see if there is a sensible benefit in using shared mutexes? Moreover, can you compare your implementation of a shared mutex against the std::shared_mutex class in C++ 17 ?

CGLemon commented 3 years ago

In my test, My shared mutex is significantly faster than std::mutex and std::shared_mutex. But I only tested it on my computer, intel-i5 dual cores cpu. I do not ensure that it will be well on other computers. I upload my test code to Google Drive. You may check the code on your computers. Be sure that my shared mutex is good or not. https://drive.google.com/drive/folders/1NhOTcEdRWOovcA9G3xr8a-OrQtvVDjgt?usp=sharing

I have only normal computer with intel i5 cpu. The cpu is not enough for large threads. So I do not test sai with large threads. But my shared mutex is same design as the UCTNode atomic lock. GCP claimed that the atomic lock can speed up the search with large threads by avoiding big lock. I guessed that shared mutex would get same result with large threads because it avoid big NNCache lock.

Vandertic commented 3 years ago

I am trying to compile the commit locally, and there is a warning SharedMutex.h:53:10: warning: unused variable ‘v’ Is it possible to remove it without deleting the assert completely?

Vandertic commented 3 years ago

I've tested the commit and everything seems to work fine. I have measured a marginal speed increase. I am very in favor of pulling this commit.

Vandertic commented 3 years ago

@CGLemon there appear to be an occasional problem with this commit, see https://github.com/sai-dev/sai/issues/142. We asked Cabu to test different configurations and the malfunctioning was restricted to the shared mutex commits. So for now we are reverting to the previous code, with release 0.18.2, since the typical increase in speed is not very visible.

I don't understand enough of multithreading to give a real contribution, but maybe you or @amato-gianluca can come up with a work-around? I think that even if shared mutex is not very important for self-plays, it could be important for everyday use in game analysis.

CGLemon commented 3 years ago

@Vandertic Very Thanks. After I saw the issue, I quickly check it. Yes. it would stall in large threads(threads >= 10, Device: intel i5-9400f with RTX 2070s). I fixed the shared lock last night. Now it looks better and do not stall again in large threads. I will push it late. But this still confused me. In my test, the shared mutex just stalled off it 10% per move. In issue #142, it is greater than 50%. Do you have any idea?

Update: After fixing the shared lock. the performance was better. But it still stalled in large threads(less than 5%).

Cabu commented 3 years ago

@CGLemon If you want to run some sai versions with some log/trace that could help you pinpoint the problem, i would be glad to help you.

amato-gianluca commented 3 years ago

@CGLemon I have some perplexities on the implementation of shared mutexes in your library. When the library is not able to acquire the lock, it calls std::this_thread::sleep_for and waits for 100 microseconds. This seems to be highly inefficient. Actually:

  1. even if the lock is freed immediately after, the thread cannot proceed before the time limit expires;
  2. after 100 microseconds the thread is woken, even if the lock is still not available.

I think the implementation should use some low level OS primitive (such as futexes in Linux) to improve this behaviour. For this reason, I really think it would be better to use the standard C++17 implementation of shared mutexes that (I assume) is optimized for the different operating systems.

CGLemon commented 3 years ago

@amato-gianluca The lock take a break after doing "test and set". The reason is that the useless "test and set" still consumes CPU resource. The "test and set" is very quick. If it don't sleep, it will do it too many times before unlocking. Sleeping some time avoid the useless thread to occupy CPU cores. On the other hand, Sleeping some time is same effect as "std::this_thread::yield". You may remove the "sleep_for" and test it. You could find that it is worse. I not quite understand the behavior of locks. Actually, I am not a computer science student. This conclusion is just my experience. Sorry for that.

Is Standard C++17 shared mutex better? I am not sure. Because it is bad if too many x-lock. But It is very fast if most lock is s-lock. I will test it in sai (18.2) tomorrow. A big problem is that I have no windows. I can't compare with different OS.

amato-gianluca commented 3 years ago

I do not suggest removing sleep_for or replacing it with yield, but using some OS synchronization primitives such as futexes. When a thread suspends on a futex, it is only woken when it can enter the critial region, which should be better than just waking from time to time and checking if the mutex is available. Howerver, these synchronization primitives changes a lot from one OS to the other, therefore a cross platform code would need to handle a lot of special cases, depending on the platform. Since this job has been already done in the C++ 17 standard library, I do not see any reason not to use the standard shared_mute instead of custom code which is harder to maintain and not as thoroughly tested.

I am also trying to get some numbers on the effective overhead caused by the exclusive lock in NNCache, to understand if an optimization is worth the pain. I have some preliminary results obtained with SystemTap. By tracing the execution of

sai -w 7197c04e4716eaf3d073630e8cae658f24de4a1c6b40627dd23b90a29561e8dc.gz -t 4

along four go commands, I obtained the following:

mutex 0x18f3d60 (0x18f3d60) contended 19 times, 11 avg usec, 150 max usec, 217 total usec, popup at
__pthread_mutex_unlock_usercnt+0x89 [libpthread-2.32.so]
_ZN7Network10get_outputEPK9GameStateNS_8EnsembleEibbb+0x17b [sai]
_ZN7UCTNode15create_childrenER7NetworkRSt6atomicIiER9GameStateRfS7_S7_f+0xf6 [sai]

mutex was last contended at
__lll_lock_wait+0x30 [libpthread-2.32.so]
__pthread_mutex_lock+0xe3 [libpthread-2.32.so]
_ZN7NNCache6insertEmRKNS_9NetresultE+0x2d [sai]
_ZN7Network10get_outputEPK9GameStateNS_8EnsembleEibbb+0x17b [sai]
_ZN7UCTNode15create_childrenER7NetworkRSt6atomicIiER9GameStateRfS7_S7_f+0xf6 [sai]

It says that the mutex used in NNCache::insert has been contented 19 times. The total time threads have been suspended waiting for mutex is 217 usec. Note also that the average waiting time is 11 usec, much smaller than the 100 usec wait time you use in sleep_for.

However, this is not a definitive answer, since:

  1. results vary a lot from one tracing attempt to the other;
  2. there is something fishy in the data, since Network::get_output using the same mutex as NNCache::insert is unexpected.
Vandertic commented 3 years ago

To see a more important use of NNCache, try after some gos to give undo a couple of times and then go again, or a reasonable alternative move and then go, so that the cache really needs to be used.

CGLemon commented 3 years ago

@amato-gianluca Following your suggestions. I optimize the shared mutex by without sleeping. Now it is better than std::mutex. Another, I add c++ 17 shared mutex. It is also better than std::mutex. Here is it #143.

But this test is not complete. I will do other tests.