Adding comments to assembly.rs

jon-chuang commented 4 years ago

Hi, I am looking to reference your assembly code: https://github.com/0xProject/OpenZKP/blob/b46dfbb13180cc5f4a7ba82cc8c24d3b7bee776a/algebra/u256/src/algorithms/assembly.rs#L5 However, most functions do not have a description of what they are supposed to do, whereas it is helpful that mul_1_asm does (// Computes r[0..5] = a * b[0..4]) Would it be possible to add some of these comments? Thanks.

recmo commented 4 years ago

The assembly routines are currently unused and a work in progress. I hope to get back to them soon. For now, they are resting in master, with the understanding that they are not finished and up to the same standard as the in-use code.

Most of these functions are different versions of 256-bit multiplication with Montgomery reduction using the ADX instructions. Is there a specific function you have questions about?

jon-chuang commented 4 years ago

Actually, for now I have some basic questions.

What are "+r", "&=r"? I understand what is "rm", `"=r" (output) and curly braces but not these.

Further, why are the mulx operands flipped vis a vis the intel documentation?

recmo commented 4 years ago

+r means it is both an input and output register. =&r means that this output register may not overlap with input registers (which is normally allowed).

By default Rust uses AT&T syntax, which has different operand order from Intel.

Documentation for Rust's inline assembly can be found here: https://doc.rust-lang.org/1.2.0/book/inline-assembly.html

But it's really a thin layer over LLVM's inline assembly syntax which is documented here: https://llvm.org/docs/LangRef.html#inline-assembler-expressions

The &=r syntax is in the LLVM manual, but for some reason I can not find a source for the +r syntax. Inline assembly is pretty badly documented and has some serious limitations. Fortunately, there is a plan to replace it with something much more pleasant in Rust.

GCC at least has a list of all the modifiers (including +, which also seems supported by LLVM):

https://gcc.gnu.org/onlinedocs/gcc/Modifiers.html#Modifiers

jon-chuang commented 4 years ago

Hi Remco, I am having some issues trying to rewrite the code to be more generic. In particular, I am getting several segfaults. I wonder if you have encountered similar issues.

Here is the assembly code I generate with a build.rs based codegen (build.rs).

assembly function

``` macro_rules! generate_asm_mul { ($limbs:expr) => { fn mul_asm ( a: [u64; $limbs], b: [u64; $limbs], modulus: [u64; $limbs], inverse: u64 ) -> [u64; $limbs] { let result = match $limbs { 2 => { const ZERO: u64 = 0; let mut result = MaybeUninit::<[u64; $limbs]>::uninit(); unsafe { asm!(" movq 0($1), %rdx movq $5, %rcx xorq %rdi, %rdi mulxq 0($2), %r8, %r9 mulxq 8($2), %rcx, %rdi adcxq %rcx, %r10 adcxq $4, %rdi // %rdi is carry1 movq $5, %rdx mulxq %r8, %rdx, %rcx // wrapping_mul mulxq 0($3), %rcx, %rbx adcxq %r9, %rcx // put junk in rax adoxq %rbx, %r8 mulxq 8($3), %rcx, %r8 adcxq %rcx, %r9 adcxq $4, %r8 adoxq %rdi, %r8 movq 8($1), %rdx mulxq 0($2), %rcx, %rbx adcxq %rcx, %r8 adoxq %rbx, %r9 mulxq 8($2), %rcx, %rdi adcxq %rcx, %r9 adcxq $4, %rdi adoxq $4, %rdi movq $5, %rdx mulxq %r9, %rdx, %rcx // wrapping_mul mulxq 0($3), %rcx, %rbx adcxq %r8, %rcx // put junk in rax adoxq %rbx, %r9 mulxq 8($3), %rcx, %r9 adcxq %rcx, %r8 adcxq $4, %r9 adoxq %rdi, %r9 movq %r8, 0($0) movq %r9, 8($0)" // : // : "r"(result.as_mut_ptr()), // "r"(&a), "r"(&b), // "m"(ZERO), // "m"(modulus[0]), // "m"(modulus[1]), // "m"(modulus[2]), // "m"(modulus[3]), // "m"(inverse) // : "rdx", "rdi", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory" : : "r"(result.as_mut_ptr()) // $0 "r"(&a), // $1 "r"(&b), // $2 "r"(&modulus), // $3 "m"(ZERO), // $4 "r"(inverse) // $5 : "rcx", "rbx", "rdx", "rdi", "r8", "r9", "cc", "memory" ); } let r = unsafe { result.assume_init() }; r }, 3 => { const ZERO: u64 = 0; let mut result = MaybeUninit::<[u64; $limbs]>::uninit(); unsafe { asm!(" movq 0($1), %rdx movq $5, %rcx xorq %rdi, %rdi mulxq 0($2), %r8, %r9 mulxq 8($2), %rcx, %r10 adcxq %rcx, %r9 mulxq 16($2), %rcx, %rdi adcxq %rcx, %r11 adcxq $4, %rdi // %rdi is carry1 movq $5, %rdx mulxq %r8, %rdx, %rcx // wrapping_mul mulxq 0($3), %rcx, %rbx adcxq %r9, %rcx // put junk in rax adoxq %rbx, %r10 mulxq 8($3), %rcx, %rbx adcxq %rcx, %r10 adoxq %rbx, %r8 mulxq 16($3), %rcx, %r8 adcxq %rcx, %r10 adcxq $4, %r8 adoxq %rdi, %r8 movq 8($1), %rdx mulxq 0($2), %rcx, %rbx adcxq %rcx, %r10 adoxq %rbx, %r8 mulxq 8($2), %rcx, %rbx adcxq %rcx, %r8 adoxq %rbx, %r9 mulxq 16($2), %rcx, %rdi adcxq %rcx, %r9 adcxq $4, %rdi adoxq $4, %rdi movq $5, %rdx mulxq %r9, %rdx, %rcx // wrapping_mul mulxq 0($3), %rcx, %rbx adcxq %r10, %rcx // put junk in rax adoxq %rbx, %r8 mulxq 8($3), %rcx, %rbx adcxq %rcx, %r8 adoxq %rbx, %r9 mulxq 16($3), %rcx, %r9 adcxq %rcx, %r8 adcxq $4, %r9 adoxq %rdi, %r9 movq 16($1), %rdx mulxq 0($2), %rcx, %rbx adcxq %rcx, %r8 adoxq %rbx, %r9 mulxq 8($2), %rcx, %rbx adcxq %rcx, %r9 adoxq %rbx, %r10 mulxq 16($2), %rcx, %rdi adcxq %rcx, %r10 adcxq $4, %rdi adoxq $4, %rdi movq $5, %rdx mulxq %r10, %rdx, %rcx // wrapping_mul mulxq 0($3), %rcx, %rbx adcxq %r8, %rcx // put junk in rax adoxq %rbx, %r9 mulxq 8($3), %rcx, %rbx adcxq %rcx, %r9 adoxq %rbx, %r10 mulxq 16($3), %rcx, %r10 adcxq %rcx, %r9 adcxq $4, %r10 adoxq %rdi, %r10 movq %r8, 0($0) movq %r9, 8($0) movq %r10, 16($0)" // : // : "r"(result.as_mut_ptr()), // "r"(&a), "r"(&b), // "m"(ZERO), // "m"(modulus[0]), // "m"(modulus[1]), // "m"(modulus[2]), // "m"(modulus[3]), // "m"(inverse) // : "rdx", "rdi", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory" : : "r"(result.as_mut_ptr()) // $0 "r"(&a), // $1 "r"(&b), // $2 "r"(&modulus), // $3 "m"(ZERO), // $4 "r"(inverse) // $5 : "rcx", "rbx", "rdx", "rdi", "r8", "r9", "r10", "cc", "memory" ); } let r = unsafe { result.assume_init() }; r }, 4 => { const ZERO: u64 = 0; let mut result = MaybeUninit::<[u64; $limbs]>::uninit(); unsafe { asm!(" movq 0($1), %rdx movq $5, %rcx xorq %rdi, %rdi mulxq 0($2), %r8, %r9 mulxq 8($2), %rcx, %r10 adcxq %rcx, %r9 mulxq 16($2), %rcx, %r11 adcxq %rcx, %r10 mulxq 24($2), %rcx, %rdi adcxq %rcx, %r12 adcxq $4, %rdi // %rdi is carry1 movq $5, %rdx mulxq %r8, %rdx, %rcx // wrapping_mul mulxq 0($3), %rcx, %rbx adcxq %r9, %rcx // put junk in rax adoxq %rbx, %r10 mulxq 8($3), %rcx, %rbx adcxq %rcx, %r10 adoxq %rbx, %r11 mulxq 16($3), %rcx, %rbx adcxq %rcx, %r11 adoxq %rbx, %r8 mulxq 24($3), %rcx, %r8 adcxq %rcx, %r11 adcxq $4, %r8 adoxq %rdi, %r8 movq 8($1), %rdx mulxq 0($2), %rcx, %rbx adcxq %rcx, %r10 adoxq %rbx, %r11 mulxq 8($2), %rcx, %rbx adcxq %rcx, %r11 adoxq %rbx, %r8 mulxq 16($2), %rcx, %rbx adcxq %rcx, %r8 adoxq %rbx, %r9 mulxq 24($2), %rcx, %rdi adcxq %rcx, %r9 adcxq $4, %rdi adoxq $4, %rdi movq $5, %rdx mulxq %r9, %rdx, %rcx // wrapping_mul mulxq 0($3), %rcx, %rbx adcxq %r10, %rcx // put junk in rax adoxq %rbx, %r11 mulxq 8($3), %rcx, %rbx adcxq %rcx, %r11 adoxq %rbx, %r8 mulxq 16($3), %rcx, %rbx adcxq %rcx, %r8 adoxq %rbx, %r9 mulxq 24($3), %rcx, %r9 adcxq %rcx, %r8 adcxq $4, %r9 adoxq %rdi, %r9 movq 16($1), %rdx mulxq 0($2), %rcx, %rbx adcxq %rcx, %r11 adoxq %rbx, %r8 mulxq 8($2), %rcx, %rbx adcxq %rcx, %r8 adoxq %rbx, %r9 mulxq 16($2), %rcx, %rbx adcxq %rcx, %r9 adoxq %rbx, %r10 mulxq 24($2), %rcx, %rdi adcxq %rcx, %r10 adcxq $4, %rdi adoxq $4, %rdi movq $5, %rdx mulxq %r10, %rdx, %rcx // wrapping_mul mulxq 0($3), %rcx, %rbx adcxq %r11, %rcx // put junk in rax adoxq %rbx, %r8 mulxq 8($3), %rcx, %rbx adcxq %rcx, %r8 adoxq %rbx, %r9 mulxq 16($3), %rcx, %rbx adcxq %rcx, %r9 adoxq %rbx, %r10 mulxq 24($3), %rcx, %r10 adcxq %rcx, %r9 adcxq $4, %r10 adoxq %rdi, %r10 movq 24($1), %rdx mulxq 0($2), %rcx, %rbx adcxq %rcx, %r8 adoxq %rbx, %r9 mulxq 8($2), %rcx, %rbx adcxq %rcx, %r9 adoxq %rbx, %r10 mulxq 16($2), %rcx, %rbx adcxq %rcx, %r10 adoxq %rbx, %r11 mulxq 24($2), %rcx, %rdi adcxq %rcx, %r11 adcxq $4, %rdi adoxq $4, %rdi movq $5, %rdx mulxq %r11, %rdx, %rcx // wrapping_mul mulxq 0($3), %rcx, %rbx adcxq %r8, %rcx // put junk in rax adoxq %rbx, %r9 mulxq 8($3), %rcx, %rbx adcxq %rcx, %r9 adoxq %rbx, %r10 mulxq 16($3), %rcx, %rbx adcxq %rcx, %r10 adoxq %rbx, %r11 mulxq 24($3), %rcx, %r11 adcxq %rcx, %r10 adcxq $4, %r11 adoxq %rdi, %r11 movq %r8, 0($0) movq %r9, 8($0) movq %r10, 16($0) movq %r11, 24($0)" // : // : "r"(result.as_mut_ptr()), // "r"(&a), "r"(&b), // "m"(ZERO), // "m"(modulus[0]), // "m"(modulus[1]), // "m"(modulus[2]), // "m"(modulus[3]), // "m"(inverse) // : "rdx", "rdi", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory" : : "r"(result.as_mut_ptr()) // $0 "r"(&a), // $1 "r"(&b), // $2 "r"(&modulus), // $3 "m"(ZERO), // $4 "r"(inverse) // $5 : "rcx", "rbx", "rdx", "rdi", "r8", "r9", "r10", "r11", "cc", "memory" ); } let r = unsafe { result.assume_init() }; r }, x => panic!("asm_mul (no-carry): number of limbs supported is 2 up to 8. You had {}", x) }; result }; } } ```

jon-chuang commented 4 years ago

By the way, would you happen to know how to get the output of the asm! macro? Do you know if cargo-expand can do this?

jon-chuang commented 4 years ago

I would also like to ask if you have any timing/clock cycle measurements for full_mul_asm2 and full_mul_asm2. One is implemented in a modular fashion while the other is written as pure assembly. This is important when we are trying to write generic code (up to 12 limbs, or more, if we include data movement).

According to Jack Fransham (http://troubles.md/the-power-of-compilers/), it is better to leave more things to the compiler. At least, that was true in the state of the asm! macro at the time, which seems truly horrible. However, I don't know how much the asm! macro locks in certain data movement patterns.

jon-chuang commented 4 years ago

By the way, is the thing that is "more pleasant in Rust" std_arch intrinsics?

As you mentioned in a separate issue, it does not seem to be emitting ADX/MULX, and this is an LLVM issue. Or, do you mean something else entirely?

recmo commented 4 years ago

The ADX/MULX opcodes have complicated data dependencies on the flags register which LLVM can not reason about and it prevents it from generating optimal code using intrinsics. From the relevant issues it seems that this requires serious architectural changes to fix, so for now inline assembly seems the only way to go to use those opcodes.

recmo commented 4 years ago

cargo perf will run a criterion benchmark which is pretty accurate for these small pieces of code. I don't think it currently tests any asm routines, but that can be added easily. I remember them being only slightly better (like 14ns per multiply instead of 20ns). For our Proth primes the difference was even less, which made it less interesting to work on.

recmo commented 4 years ago

The 256 bit case is special because it just barely fits in the available general purpose registers, if you want to generalize it to higher numbers of limbs it will require some changes in design, but seems like you have that covered.

0xProject / OpenZKP

Adding comments to assembly.rs #657