Open jon-chuang opened 4 years ago
The assembly routines are currently unused and a work in progress. I hope to get back to them soon. For now, they are resting in master, with the understanding that they are not finished and up to the same standard as the in-use code.
Most of these functions are different versions of 256-bit multiplication with Montgomery reduction using the ADX instructions. Is there a specific function you have questions about?
Actually, for now I have some basic questions.
What are "+r"
, "&=r"?
I understand what is "rm"
, `"=r" (output) and curly braces but not these.
Further, why are the mulx operands flipped vis a vis the intel documentation?
+r
means it is both an input and output register. =&r
means that this output register may not overlap with input registers (which is normally allowed).
By default Rust uses AT&T syntax, which has different operand order from Intel.
Documentation for Rust's inline assembly can be found here: https://doc.rust-lang.org/1.2.0/book/inline-assembly.html
But it's really a thin layer over LLVM's inline assembly syntax which is documented here: https://llvm.org/docs/LangRef.html#inline-assembler-expressions
The &=r
syntax is in the LLVM manual, but for some reason I can not find a source for the +r
syntax. Inline assembly is pretty badly documented and has some serious limitations. Fortunately, there is a plan to replace it with something much more pleasant in Rust.
GCC at least has a list of all the modifiers (including +
, which also seems supported by LLVM):
Hi Remco, I am having some issues trying to rewrite the code to be more generic. In particular, I am getting several segfaults. I wonder if you have encountered similar issues.
Here is the assembly code I generate with a build.rs
based codegen (build.rs).
By the way, would you happen to know how to get the output of the asm!
macro? Do you know if cargo-expand can do this?
I would also like to ask if you have any timing/clock cycle measurements for full_mul_asm2
and full_mul_asm2
. One is implemented in a modular fashion while the other is written as pure assembly. This is important when we are trying to write generic code (up to 12 limbs, or more, if we include data movement).
According to Jack Fransham (http://troubles.md/the-power-of-compilers/), it is better to leave more things to the compiler. At least, that was true in the state of the asm!
macro at the time, which seems truly horrible. However, I don't know how much the asm!
macro locks in certain data movement patterns.
By the way, is the thing that is "more pleasant in Rust" std_arch
intrinsics?
As you mentioned in a separate issue, it does not seem to be emitting ADX/MULX, and this is an LLVM issue. Or, do you mean something else entirely?
The ADX/MULX opcodes have complicated data dependencies on the flags register which LLVM can not reason about and it prevents it from generating optimal code using intrinsics. From the relevant issues it seems that this requires serious architectural changes to fix, so for now inline assembly seems the only way to go to use those opcodes.
cargo perf
will run a criterion benchmark which is pretty accurate for these small pieces of code. I don't think it currently tests any asm routines, but that can be added easily. I remember them being only slightly better (like 14ns per multiply instead of 20ns). For our Proth primes the difference was even less, which made it less interesting to work on.
The 256 bit case is special because it just barely fits in the available general purpose registers, if you want to generalize it to higher numbers of limbs it will require some changes in design, but seems like you have that covered.
Hi, I am looking to reference your assembly code: https://github.com/0xProject/OpenZKP/blob/b46dfbb13180cc5f4a7ba82cc8c24d3b7bee776a/algebra/u256/src/algorithms/assembly.rs#L5 However, most functions do not have a description of what they are supposed to do, whereas it is helpful that
mul_1_asm
does (// Computes r[0..5] = a * b[0..4]
) Would it be possible to add some of these comments? Thanks.