Class group optimization

alanefl commented 5 years ago

This PR is a WIP, but I thought I'd put it up for visibility since it's large. We still need to update the benches so that we get a clean comparison between different optimizations and between class groups and RSA groups. Will notify when that's ready.

This PR is big -- let's start having a discussion about it.

Here are the main changes/additions:

Adds two big optimizations for class groups, described below.
Breaks up the single considerably more complexclass.rs file into a file that contains ClassGroup, a file that contains ClassElem, a file that defines the discriminant, and a file that defines and implements ClassCtx

Optimizations

Mpz context for no-reallocation class group operations (~4-6x speedup). All class group operations are delegated from ClassGroup into a ClassCtx, a thread-local struct of Mpz variables that is only allocated once and then reused throughout all class group operations (bye bye clones). We implemented mpz.rs as a rust wrapper around a handful gmp-mpfr-sys calls for better control over memory allocation (@mstraka100 can comment here). This also means we re-wrote the previous implementation using this interface. The classgroup modules look like this now:

Fast squaring with NUDULP and FLINT (~2x speedup on top of opt 1). We implemented a fast ClassGroup squaring algorithm (NUDULP) from the literature and used the technique from the top submission to Chia's VDF competition a few weeks back (using a single external call to FLINT to replace some steps in NUDULP. Since running these optimizations involves an external dependency with a longer build time and extra installation steps (see below), we make them opt-in with --features nudulp or --features nudulp,flint.

Adding Flint as a Dependency

Getting the additional 2x speedup from optimization 2 requires a user to have gmp and mpfr installed in their system (can be done with brew/apt). It also requires building and binding to the FLINT library. The decision in this PR was to include the entire source code for flint 2.5.2 under a new ext/ directory (this PR omits the source code dump for clarity), and build it with cargo using the build.rs file -- in fact, this is what gmp-mpfr-sys does for gmp. Feedback welcome.

Summary of Benchmark Results for Group Ops

// RSA ops
group_rsa_op_large      time:   [1.7021 us 1.7119 us 1.7276 us]                                
group_rsa_exp           time:   [390.64 us 392.59 us 394.76 us]                          
group_rsa_inv           time:   [1.0678 us 1.0807 us 1.0935 us]                           
group_rsa_square        time:   [240.01 ns 243.12 ns 246.62 ns]                             

// Unoptimized class groups
group_class_op          time:   [8.9037 us 9.1472 us 9.4274 us]                            
group_class_exp         time:   [182.04 ms 183.61 ms 185.43 ms]                             
group_class_inv         time:   [271.48 ns 273.70 ns 276.14 ns]    
group_class_square      time:   [2.8983 us 2.9119 us 2.9280 us]                           
group_class_normalize   time:   [798.14 ns 808.68 ns 818.88 ns]                                   
group_class_reduce      time:   [1.6044 us 1.6139 us 1.6250 us]                                                             

// Class groups w/ Mpz CTX but without NUDULP and FLINT
group_class_op          time:   [1.7028 us 1.7142 us 1.7280 us]                            
group_class_exp         time:   [43.543 ms 45.305 ms 47.568 ms]                             
group_class_inv         time:   [422.27 ns 453.23 ns 494.03 ns]     
group_class_square      time:   [763.98 ns 785.25 ns 817.89 ns]                        
group_class_normalize   time:   [273.87 ns 286.48 ns 301.48 ns]                                   
group_class_reduce      time:   [625.57 ns 688.28 ns 763.48 ns]                                                               

// Class groups w/ Mpz CTX and NUDULP/FLINT
group_class_op          time:   [1.7733 us 1.7963 us 1.8203 us]                            
group_class_exp         time:   [25.157 ms 25.427 ms 25.733 ms]                             
group_class_inv         time:   [400.08 ns 413.91 ns 429.36 ns]   
group_class_square      time:   [747.48 ns 837.03 ns 949.23 ns]                             
group_class_normalize   time:   [280.35 ns 291.05 ns 304.79 ns]                                   
group_class_reduce      time:   [507.16 ns 517.52 ns 528.47 ns]

eddiew commented 5 years ago

Ok again we'll go through this review in stages. Here's the biggest change I'd like to see:

Great idea to pre-allocate all of the intermediate calculations into ClassCtx. What you've done is essentially the first step of writing a custom allocator. I think the code could be cleaned up by going one step further down this path.

The change I propose is to change

ClassCtx {
  var1: Mpz,
  var2: Mpz,
  var3: Mpz,
  ...
  sctx: LinCongruenceCtx,
}

to

ClassCtx {
  scratch: [Mpz; N], // N is big enough such that there is enough scratch space for the largest class group fn.
}

with functions consuming it like this:

pub fn op(&mut self, x: &ClassElem, y: &ClassElem) -> ClassElem {
  // "Take" scratch space and assign variable names
  // This could possibly be made more concise with tuple-destructuring and/or a macro
  let g = &self.scratch[0];
  let h = &self.scratch[1];
  let w = &self.scratch[2];
  ...
  // logic

LinearCongruenceCtx should be replaced by a &[Mpz; 5], in a similar manner. It may make sense to move solve_linear_congruence into class/ctx.rs to avoid code gymnastics.

The biggest benefit from making this change is a cleaner separation of business logic from memory management. It will also allow you to shrink the context object and potentially be more cache-efficient but these effects will be minor at best.

alanefl commented 5 years ago

Good idea. Agree this makes things neater to read and reason about.

eddiew commented 5 years ago

Once you finish that, the next thing will be to look into thread safety for the context object. Currently, as well as after the above change, access to the context object must be locked between threads in order to be thread safe. A good solution might be to automatically clone the context object if it gets shared between threads but I'm less sure about how that might be done. See what you can find out

alanefl commented 5 years ago

So about thread safety, AFAIK the thread_local! macro ensures that every thread gets its own new ClassCtx right now. Do you think it's worth introducing locks so that even n threads are operating on the same context? Off the top of my head, that does not seem like it would get us a big performance improvement, since all each thread needs is some memory that's been preallocated once when the thread was instantiated. (@mstraka100 is more knowledgeable about the thread_local! macro -- in any case, we'll take another look to make sure)

eddiew commented 5 years ago

Ah if that's the case then it should be good.

The next step after updating the context now will be to move logic for the group operations back into class/mod.rs. Such that the context is only a thread-local static buffer of mpz elements and contains no logic. This removes one nesting of the business logic with little other impact.

So instead of

  fn op_(_: &Mpz, x: &ClassElem, y: &ClassElem) -> ClassElem {
    with_context!(op, x, y)
  }

write

  fn op_(_: &Mpz, x: &ClassElem, y: &ClassElem) -> ClassElem {
    with_context!(|ctx| {
     // logic
    })
  }

It may be good to rename ctx to scratch or similar at this pt. You may also be able to eliminate the ctx.rs file

alanefl commented 5 years ago

Sure, this makes sense as well. It's possible there might be some unintended consequences of doing things this way -- we can discuss if we hit any.

Also, just updated the rsa and class benches to have comparable ops (e.g., do exp with the same exponent)

Here's a summary of benchmark results:

// RSA ops
group_rsa_op_large      time:   [1.7021 us 1.7119 us 1.7276 us]                                
group_rsa_exp           time:   [390.64 us 392.59 us 394.76 us]                          
group_rsa_inv           time:   [1.0678 us 1.0807 us 1.0935 us]                           
group_rsa_square        time:   [240.01 ns 243.12 ns 246.62 ns]                             

// Unoptimized class groups
group_class_op          time:   [8.9037 us 9.1472 us 9.4274 us]                            
group_class_exp         time:   [182.04 ms 183.61 ms 185.43 ms]                             
group_class_inv         time:   [271.48 ns 273.70 ns 276.14 ns]    
group_class_square      time:   [2.8983 us 2.9119 us 2.9280 us]                           
group_class_normalize   time:   [798.14 ns 808.68 ns 818.88 ns]                                   
group_class_reduce      time:   [1.6044 us 1.6139 us 1.6250 us]                                                             

// Class groups w/ Mpz CTX but without NUDULP and FLINT
group_class_op          time:   [1.7028 us 1.7142 us 1.7280 us]                            
group_class_exp         time:   [43.543 ms 45.305 ms 47.568 ms]                             
group_class_inv         time:   [422.27 ns 453.23 ns 494.03 ns]     
group_class_square      time:   [763.98 ns 785.25 ns 817.89 ns]                        
group_class_normalize   time:   [273.87 ns 286.48 ns 301.48 ns]                                   
group_class_reduce      time:   [625.57 ns 688.28 ns 763.48 ns]                                                               

// Class groups w/ Mpz CTX and NUDULP/FLINT
group_class_op          time:   [1.7733 us 1.7963 us 1.8203 us]                            
group_class_exp         time:   [25.157 ms 25.427 ms 25.733 ms]                             
group_class_inv         time:   [400.08 ns 413.91 ns 429.36 ns]   
group_class_square      time:   [747.48 ns 837.03 ns 949.23 ns]                             
group_class_normalize   time:   [280.35 ns 291.05 ns 304.79 ns]                                   
group_class_reduce      time:   [507.16 ns 517.52 ns 528.47 ns]

@mstraka100

eddiew commented 5 years ago

nice. It would be interesting to see if any of the optimization techniques you use could apply to the rsa group implementation as well. But we can address that in a separate branch/PR

mstraka100 commented 5 years ago

Agree on rsa groups; I'll take a look later this week. Comments look good and moving group operations into mod.rs should work. I like your idea of having the scratch space return tuples, i.e. if N = 5,

ClassCtx {
  scratch: [Mpz; 5], 
}

fn foo() {
  with_context!( |ctx| {
    let (g, s, e) = ctx.get_mpz_vars(0, 3); // return 3 elements starting at index 0
    let (x, y) = ctx.get_mpz_vars(4, 2); // throws out of bounds error

alanefl commented 5 years ago

Ok, I addressed all comments brought up -- sorry this took some time, I ran into a good number of Rust-related issues before landing to with_ctx and mut_tuple_elems as written here.

Keep the feedback coming

mstraka100 commented 5 years ago

Instead of passing in individual integers into mut_tuple_elems! I think it would be better to pass in ranges, i.e. mut_tuple_elems!(self, 0, 4) instead of mut_tuple_elems!(self, 0, 1, 2, 3, 4).

It would also be good to make a parallel FMpz type analogous to the Mpz type for flint operations, by making a wrapper struct with methods for the flint bindings that take on the burden of being "unsafe" themselves instead of having an unsafe block in the squaring operation.

First glance looks great otherwise.

alanefl commented 5 years ago

@mstraka100 we may be able to hack together a macro to have ranges in the mut_tuple_elems! macro, but it will be a hack. Check out: https://stackoverflow.com/questions/33751796/is-there-a-way-to-count-with-macros. I'm deferring this for now.

pgrinaway commented 5 years ago

Hi all,

This seems to be a pretty exciting improvement! I checked out the corresponding branch and ran the benchmarks, but the class group accumulator add_{} operations seem to actually be a touch slower than with the code in master. Am I missing some contributions? Also, are there current plans to finish this PR, or would this need to be finished up by someone else?

Thanks!

whaatt commented 5 years ago

@pgrinaway I'll look into this some more, but as a sanity check, did you compile with the external dependencies (NUDULP and FLINT)?

Regarding this PR:

No one is actively working on this repo at the moment, and I'm just fielding questions and issues as they arise. If people are interested in getting this merged, @alanefl or @mstraka100 would be the best developers to talk to.

Ideally, someone would sign on as a regular maintainer, so please send me a DM if you (or anyone else) is interested in taking on that role!

pgrinaway commented 5 years ago

Thanks for the reply!

I'll look into this some more, but as a sanity check, did you compile with the external dependencies (NUDULP and FLINT)?

I realized I didn't, so that is likely the problem. However, I can't seem to get FLINT to build--I am looking for where it might be (I enabled the feature, but that leads to an error that the configure file doesn't exist, so I assume I need to manually find it elsewhere). Is there some extra step I should follow, or is there a place where I can find the source for FLINT?

No one is actively working on this repo at the moment, and I'm just fielding questions and issues as they arise. If people are interested in getting this merged, @alanefl or @mstraka100 would be the best developers to talk to.

Got it, thanks. We're evaluating the class group stuff now, so I will keep you posted.

pgrinaway commented 5 years ago

Actually, I think I've fixed the FLINT issue. Benchmarking now.

pgrinaway commented 5 years ago

OK, I am seeing about the same speed (~400ms to add 10 elements) with this branch vs. master in the class group

EDIT: I do see a 2x speedup on the exponentiation operation by including NUDULP and FLINT

daira commented 4 years ago

What is NUDULP? A typo for NUDUPL, or a different algorithm?

cambrian / accumulator