Open alanefl opened 5 years ago
Ok again we'll go through this review in stages. Here's the biggest change I'd like to see:
Great idea to pre-allocate all of the intermediate calculations into ClassCtx. What you've done is essentially the first step of writing a custom allocator. I think the code could be cleaned up by going one step further down this path.
The change I propose is to change
ClassCtx {
var1: Mpz,
var2: Mpz,
var3: Mpz,
...
sctx: LinCongruenceCtx,
}
to
ClassCtx {
scratch: [Mpz; N], // N is big enough such that there is enough scratch space for the largest class group fn.
}
with functions consuming it like this:
pub fn op(&mut self, x: &ClassElem, y: &ClassElem) -> ClassElem {
// "Take" scratch space and assign variable names
// This could possibly be made more concise with tuple-destructuring and/or a macro
let g = &self.scratch[0];
let h = &self.scratch[1];
let w = &self.scratch[2];
...
// logic
LinearCongruenceCtx
should be replaced by a &[Mpz; 5]
, in a similar manner. It may make sense to move solve_linear_congruence
into class/ctx.rs to avoid code gymnastics.
The biggest benefit from making this change is a cleaner separation of business logic from memory management. It will also allow you to shrink the context object and potentially be more cache-efficient but these effects will be minor at best.
Good idea. Agree this makes things neater to read and reason about.
Once you finish that, the next thing will be to look into thread safety for the context object. Currently, as well as after the above change, access to the context object must be locked between threads in order to be thread safe. A good solution might be to automatically clone the context object if it gets shared between threads but I'm less sure about how that might be done. See what you can find out
So about thread safety, AFAIK the thread_local!
macro ensures that every thread gets its own new ClassCtx
right now. Do you think it's worth introducing locks so that even n
threads are operating on the same context? Off the top of my head, that does not seem like it would get us a big performance improvement, since all each thread needs is some memory that's been preallocated once when the thread was instantiated. (@mstraka100 is more knowledgeable about the thread_local!
macro -- in any case, we'll take another look to make sure)
Ah if that's the case then it should be good.
The next step after updating the context now will be to move logic for the group operations back into class/mod.rs. Such that the context is only a thread-local static buffer of mpz elements and contains no logic. This removes one nesting of the business logic with little other impact.
So instead of
fn op_(_: &Mpz, x: &ClassElem, y: &ClassElem) -> ClassElem {
with_context!(op, x, y)
}
write
fn op_(_: &Mpz, x: &ClassElem, y: &ClassElem) -> ClassElem {
with_context!(|ctx| {
// logic
})
}
It may be good to rename ctx to scratch or similar at this pt. You may also be able to eliminate the ctx.rs file
Sure, this makes sense as well. It's possible there might be some unintended consequences of doing things this way -- we can discuss if we hit any.
Also, just updated the rsa
and class
benches to have comparable ops (e.g., do exp
with the same exponent)
Here's a summary of benchmark results:
// RSA ops
group_rsa_op_large time: [1.7021 us 1.7119 us 1.7276 us]
group_rsa_exp time: [390.64 us 392.59 us 394.76 us]
group_rsa_inv time: [1.0678 us 1.0807 us 1.0935 us]
group_rsa_square time: [240.01 ns 243.12 ns 246.62 ns]
// Unoptimized class groups
group_class_op time: [8.9037 us 9.1472 us 9.4274 us]
group_class_exp time: [182.04 ms 183.61 ms 185.43 ms]
group_class_inv time: [271.48 ns 273.70 ns 276.14 ns]
group_class_square time: [2.8983 us 2.9119 us 2.9280 us]
group_class_normalize time: [798.14 ns 808.68 ns 818.88 ns]
group_class_reduce time: [1.6044 us 1.6139 us 1.6250 us]
// Class groups w/ Mpz CTX but without NUDULP and FLINT
group_class_op time: [1.7028 us 1.7142 us 1.7280 us]
group_class_exp time: [43.543 ms 45.305 ms 47.568 ms]
group_class_inv time: [422.27 ns 453.23 ns 494.03 ns]
group_class_square time: [763.98 ns 785.25 ns 817.89 ns]
group_class_normalize time: [273.87 ns 286.48 ns 301.48 ns]
group_class_reduce time: [625.57 ns 688.28 ns 763.48 ns]
// Class groups w/ Mpz CTX and NUDULP/FLINT
group_class_op time: [1.7733 us 1.7963 us 1.8203 us]
group_class_exp time: [25.157 ms 25.427 ms 25.733 ms]
group_class_inv time: [400.08 ns 413.91 ns 429.36 ns]
group_class_square time: [747.48 ns 837.03 ns 949.23 ns]
group_class_normalize time: [280.35 ns 291.05 ns 304.79 ns]
group_class_reduce time: [507.16 ns 517.52 ns 528.47 ns]
@mstraka100
nice. It would be interesting to see if any of the optimization techniques you use could apply to the rsa group implementation as well. But we can address that in a separate branch/PR
Agree on rsa groups; I'll take a look later this week. Comments look good and moving group operations into mod.rs should work. I like your idea of having the scratch space return tuples, i.e. if N = 5,
ClassCtx {
scratch: [Mpz; 5],
}
fn foo() {
with_context!( |ctx| {
let (g, s, e) = ctx.get_mpz_vars(0, 3); // return 3 elements starting at index 0
let (x, y) = ctx.get_mpz_vars(4, 2); // throws out of bounds error
Ok, I addressed all comments brought up -- sorry this took some time, I ran into a good number of Rust-related issues before landing to with_ctx
and mut_tuple_elems
as written here.
Keep the feedback coming
Instead of passing in individual integers into mut_tuple_elems!
I think it would be better to pass in ranges, i.e. mut_tuple_elems!(self, 0, 4)
instead of mut_tuple_elems!(self, 0, 1, 2, 3, 4)
.
It would also be good to make a parallel FMpz type analogous to the Mpz type for flint operations, by making a wrapper struct with methods for the flint bindings that take on the burden of being "unsafe" themselves instead of having an unsafe block in the squaring operation.
First glance looks great otherwise.
@mstraka100 we may be able to hack together a macro to have ranges in the mut_tuple_elems!
macro, but it will be a hack. Check out: https://stackoverflow.com/questions/33751796/is-there-a-way-to-count-with-macros. I'm deferring this for now.
Hi all,
This seems to be a pretty exciting improvement! I checked out the corresponding branch and ran the benchmarks, but the class group accumulator add_{}
operations seem to actually be a touch slower than with the code in master
. Am I missing some contributions? Also, are there current plans to finish this PR, or would this need to be finished up by someone else?
Thanks!
@pgrinaway I'll look into this some more, but as a sanity check, did you compile with the external dependencies (NUDULP
and FLINT
)?
Regarding this PR:
No one is actively working on this repo at the moment, and I'm just fielding questions and issues as they arise. If people are interested in getting this merged, @alanefl or @mstraka100 would be the best developers to talk to.
Ideally, someone would sign on as a regular maintainer, so please send me a DM if you (or anyone else) is interested in taking on that role!
Thanks for the reply!
I'll look into this some more, but as a sanity check, did you compile with the external dependencies (NUDULP and FLINT)?
I realized I didn't, so that is likely the problem. However, I can't seem to get FLINT
to build--I am looking for where it might be (I enabled the feature, but that leads to an error that the configure
file doesn't exist, so I assume I need to manually find it elsewhere). Is there some extra step I should follow, or is there a place where I can find the source for FLINT
?
No one is actively working on this repo at the moment, and I'm just fielding questions and issues as they arise. If people are interested in getting this merged, @alanefl or @mstraka100 would be the best developers to talk to.
Got it, thanks. We're evaluating the class group stuff now, so I will keep you posted.
Actually, I think I've fixed the FLINT
issue. Benchmarking now.
OK, I am seeing about the same speed (~400ms to add 10 elements) with this branch vs. master in the class group
EDIT: I do see a 2x speedup on the exponentiation operation by including NUDULP
and FLINT
What is NUDULP? A typo for NUDUPL, or a different algorithm?
This PR is a WIP, but I thought I'd put it up for visibility since it's large. We still need to update the benches so that we get a clean comparison between different optimizations and between class groups and RSA groups. Will notify when that's ready.
This PR is big -- let's start having a discussion about it.
Here are the main changes/additions:
class.rs
file into a file that containsClassGroup
, a file that containsClassElem
, a file that defines the discriminant, and a file that defines and implementsClassCtx
Optimizations
ClassGroup
into aClassCtx
, a thread-local struct of Mpz variables that is only allocated once and then reused throughout all class group operations (bye bye clones). We implementedmpz.rs
as a rust wrapper around a handfulgmp-mpfr-sys
calls for better control over memory allocation (@mstraka100 can comment here). This also means we re-wrote the previous implementation using this interface. The classgroup modules look like this now:--features nudulp
or--features nudulp,flint
.Adding Flint as a Dependency
Getting the additional 2x speedup from optimization 2 requires a user to have
gmp
andmpfr
installed in their system (can be done with brew/apt). It also requires building and binding to the FLINT library. The decision in this PR was to include the entire source code for flint 2.5.2 under a newext/
directory (this PR omits the source code dump for clarity), and build it with cargo using thebuild.rs
file -- in fact, this is what gmp-mpfr-sys does for gmp. Feedback welcome.Summary of Benchmark Results for Group Ops