db3108 / michi-c2

Michi-c2 --- development version for michi-c
25 stars 9 forks source link

Could you give me favor? #11

Closed y-ich closed 8 years ago

y-ich commented 8 years ago

Hi.

I will be happy if you give me any advice. I am porting your michi-c2 in Rust language. Porting is done, but its performance of tree_search is 9 times slower than yours even if I carefully removed allocations and memcpy that your code does not have. I checked that counts of heavily iterated parts are same and compared two by sampling profiler (Instruments on OS X) and found that heavily iterated parts in my code have more samplings then yours. I guess that it means mine fails to keep L1 cache and is stalled. What do you think? And if so, do you have any development tools to check whether your code keep executing in L1 cache?

Thanks in advance

db3108 commented 8 years ago

Hi y-ich,

Sorry I do not know rust language at all, so my advice could be wrong.

What I tried , in michi-c2, was to stay as close as possible to the michi.py algortihms (in plain C). Except that I kept tracks of the connections between stones (stones are connected if they are in the same block) and also of the liberties every block has. I was careful so that the data structure needed to register these connections is small enough so all the memory needed to play a random game can stay in the L1 cache of a standard Intel processor (32kbytes).

What is the real performance driver in your rust code ?

There are much more more improvements available to make random playouts faster in C. I will come back to this later, but for the moment I have no time to improve michi-c2. Sorry.

Denis

y-ich commented 8 years ago

Thank you for your comments in your busy time.

mcplayout is dominant part. Both (playout and tree update) are about 9 times slower than yours. An interesting thing is that struct copy (memmove) is 7 times samplings than yours though counts of copying struct Position are same! Actually Position in my code is 1.06 times larger than you because I make bit board struct for libertes and merged bit board and nbits. Araay alignment made Position larger.

I am also interested in improving code but my main interest is if Rust is competitive with C in performance.

I have one quick question. Do you have any penaltiy experience when you miss L1 cache?

db3108 commented 8 years ago

From what you write, if I correctly interpret it, it seems that the penalty of using rust is about a nine times slowdown for basic copying operation with respect to C. Am I wrong ?

If I, again, correctly understand what you write. It seems that our representation of the block liberties is the same (for the 19x19 board : a bitfield of 384 bits in my case, in order that it fits nicely in 3 x 128 bits XMM registers of the basic intel processors ... but this is not used today in the current michi-c2 code. I guess that what you use as bitboard is the same ... except that is is maybe not a multiple of 128 bits).

Sorry. I can't answer your "quick" question quickly. I certainly lack knowledge in tools for analyzing the low level performance of the code. I currently use the Linux perf tool, and I am not sure that I am able to correctly interpret all the results. I know that I should switch to Mac OS and use the "Intrument" provided with it but, for the moment, I didn't find enough time to dive into it.

In the C code, it seems that most of the time is lost in "branch mispredictions" (as it seems often to be the case for "integer code" according to eg. the classic Hennessy, Patterson, Computer Architecture, Fifth Edition: A Quantitative Approach). This seems to be quite understandable according to the number of tests that are performed in the code (and most importantly, the difficulty to predict the results of these tests in the case of Go). The cache misses does not seem to be significant. I didn't have any experience with cache miss penalty as one of my primary priorities is to avoid them. I didn't check that this is the most important aspect of the code performance. I took that for granted ;-)) But you are right, this should be checked.

y-ich commented 8 years ago

Thank you for your sincere answer. Your understanding is right and what I wanted to emphasize is, both your C code and my Rust code use same memmove function in libc and calling counts are also same (though it may be not precise). I wonder difference of mispredictions may not exist in memmove. I have never experienced the impact of speed concretely when you miss L1 cache, so I asked just your feeling. Anyway,thank you for your michi-c2 project. Pachi is tool big for me and michi is slightly difficult to be modified because it treats board representation as string.