dendibakh / perf-book

The book "Performance Analysis and Tuning on Modern CPU"
https://book.easyperf.net/perf_book
Creative Commons Zero v1.0 Universal
2k stars 144 forks source link

Chapter 3 - ISA: X86 duality #44

Closed pveentjer closed 2 months ago

pveentjer commented 3 months ago

Added note about the duality of load/store and register/memory behavior of the X86.

dendibakh commented 3 months ago

Not sure what value it provides to the readers. If there is some implicit point that you're trying to make, then I suggest to make it explicit.

pveentjer commented 3 months ago

Hi Denis,

thank you for the review.

The point I'm trying to make is that in the initial section, it states that modern architectures are load-store. X86 is one of the most used ISAs and isn't a load-store architecture but register-memory. As a consequence, a reader of the book could falsely assume that X86 ISA is a load-store architecture.

If you think this brings no value, I'll close the PR.

If you think this brings value, I'm all ears to rewrite it. Perhaps add it as a footnote?

Regards,

Peter.

dendibakh commented 3 months ago

Ok, I changed the original paragraph: register-based, load-store architectures -> register-memory architectures Let me know if that's good now.

pveentjer commented 3 months ago

Hi Denis,

the original text was correct. So modern ISA's like RISC-V and ARM are load-store architectures.

The X86 ISA is register-memory, but after uops conversion, the X86 microarchitecture also has transformed into a load-store architecture.

dendibakh commented 3 months ago

Ooops, of course, you're right. I was implicitly thinking about x86 again. :) Will fix.

dendibakh commented 3 months ago

Please check now.

pveentjer commented 3 months ago

What is missing is that the X86 microarchitecture is a load/store architecture as part of the uops conversion.

Given the following code:

add [C],[A],[B] ;; C=A+B

After uops conversion it could look like this:

load r1, [A]         ;; load [A] in r1
load r2, [B]         ;; load [B] in r2
add r3, r1, r2       ;; add r1 and r2 and write it to r3
store r3, [C]        ;; store r3 in [C]

I find it very helpful because I need to think a lot less about the complex addressing modes and it helps me to understand the performance opportunities. In the first example, it isn't immediately clear that the loads of [A] and [B] can be performed out of order, but in the uops version, it is much more obvious.

It could also help to prevent people to manually 'optimize' code like this (C-example):

register int a=A; 
register int b=B;
C=a+b;

This is written by a 'smart' engineer who wants to help the CPU by giving more opportunities for out-of-order execution because he doesn't understand the uops version of add [C],[A],[B]. But he is just making the code more complex and bigger and the CPU will already do this for him anyway.

dendibakh commented 3 months ago

Ok, I agree, but I think this section is not the best place to discuss uops. I have a section for this: 4-4 UOP.md There I talk about uops cracking. Please check it and let me know if you have any comments.

dendibakh commented 2 months ago

Thank you @pveentjer !