Finalize implementation of the division and remainder routines

AaronKutch commented 6 years ago

I have been working on an algorithm, and I made a version of it for 128 bit integers. On my computer, the benchmarks are showing it to be about 4.5 times faster than what rust currently uses (except for the mid and lo benchmarks which still show improvement). EDIT I now have the algorithm here https://github.com/AaronKutch/specialized-div-rem/blob/master/src/all_all.rs

AaronKutch commented 6 years ago

Also I should note that it should only be faster if the cpu this is run on has 64 bit division capability.

Robbepop commented 6 years ago

Hey, wow this is awesome! I am very time constraint today but will definitely look into this tomorrow and give you feedback about it. :)

Thank you very much and keep up the good work!!

AaronKutch commented 6 years ago

I just found out some more problems with my algorithm. I will fix it and clean it up today

Robbepop commented 6 years ago

Okay tell me when you are done and I will review the cleaned up code. :)

AaronKutch commented 6 years ago

I edited my initial comment with the new code. The rand benchmark unfortunately regressed from about 5x to 2.7x improvement due to needing some extra computation to prevent the off-by-one error, but everything else looks good. I had another regression that I fixed by adding the carrying_mul part and I might look at the generated assembly a bit and improve some things.

Robbepop commented 6 years ago

I have downloaded the code and benchmarked it. Added some minor tests, especially for some edge cases but couldn't find any bugs. So this is the long-division algorithm as stated in the comments, right? Can this implementation of it simply be generalized to work with bit widths greater than 128 bits?

Feedback

Signed division support - I think this can be achieved by a simple wrapper around the unsigned variant - not sure though if I am missing some edge cases here.
Many tests besides the benches.
I like the use of future-to-come APIs such as carrying_mul - very future driven development! :)
Maybe it is a good idea to also make the division_by_const publicly available if it is a decent performance factor and offers some nice use cases - but we could make this happen in later revisions as well.
For performance reasons we could install some fast checks, as a < b => return 0 that should improve performance in some cases for large numbers.
I understand why you are using a macro for the implementation but I think we should take a more generalized way that may not target the best performance on all plattforms but maybe a good-enough performance. I am actually a bit concerned about the largest integer that can still be divided fast on the hardware stuff.

That's what I got so far.

And not to forget: Thank you for the hard work as always! :)

AaronKutch commented 6 years ago

The reason the macro was made was because it can potentially be used for a better general apint implementation. The original general algorithm involves dividing a DoubleDigit by a Digit followed by many Digit multiplies, but the 128 bit division provided by rust was so slow that it was faster to halve all of the parts and use a 64 bit division (at least in many cases). I still need to test if my 128 bit division algorithm is fast enough to go back to using full Digits, but either way everything is sped up a little. Unit tests are not very useful for an algorithm with many branches, so instead I might make a formal proof that all inputs work. Do you think this has a good chance at replacing Rust's current division algorithm? I made it a macro because it should make a faster 64 bit division algorithm for cpus without hardware 64 bit division, and so on. I know that Rust has a way of building for specific cpus (e.g. the actual bit width of usize can change), and it can switch between algorithms depending on that.

Robbepop commented 6 years ago

I understand the reason for the macro in the case you want to file this algorithm for the standard library. I can imagine that you see performance regressions towards the standard lib since your benchmark hardware might be too old to support 64-bit and 128-bit hardware-division and thus it is a soft-division as is yours. You could at least try to show this algorithm to some people that have knowledge about this stuff. I think the best plattform for this would be a dedicated IRC channel. :) Try it!

AaronKutch commented 6 years ago

I improved on it some more but now the function, test, and bench marking code is over 400 lines so I won't post it here. I think I should make a crate for it and some other functions, so I can point at it and others can look at it and use it. What should I call it?

Robbepop commented 6 years ago

An external crate for this might be a good idea if the algorithms are in a very general form and are customizable while remaining to be efficient for general and special purposes. A name could be simply the techical name of the algorithms you implemented, e.g. long_division.

AaronKutch commented 6 years ago

I am almost ready to publish my crate specialized-div-rem. I think I will add a dependency to apint that depends on this crate. We can add a feature that turns this on and use the standard library 128 bit division otherwise.

Robbepop commented 6 years ago

It is nice that you have found a general enough way to implement this feature as a standalone crate. :) I think feature opt-in as you have described would be the preferred option here. :) Thank you again for the hard work!

AaronKutch commented 6 years ago

I have it on crates.io and need to fix some things, but how do I get it onto Github? Do I just create a repository and push to that?

Robbepop commented 6 years ago

Yeah you have to create a new repo at GitHub. GitHub displays very concrete and nice steps how to continue if you already have a repo on your hard disk. :)

AaronKutch commented 6 years ago

Could you glance over everything to check if anything is wrong? https://crates.io/search?q=specialized-div-rem. The only problem I see is very long lines. I can't find what rust recommends for maximum line length. I ran rustfmt but it does not fix the comment lines

AaronKutch commented 6 years ago

I found out that when I add a dependency to the crate, cargo gives an error when trying to compile: error: no matching package named specialized-div-rem found. I think I configured something wrong. Also, do I need to put some kind of licensing thing in the README.md like you did for apint?

Robbepop commented 6 years ago

I have reviewed your crate and have the following feedback for you.

The crate's README should be formatted properly with markdown code segments, especially for the benchmark statistics.
The provided functions are all missing Rust documentation.
Since the crate is a low-level library you should try to keep the interface (quantity of provided APIs) as low as possible. I don't think that #[inline(always)] variants are a huge gain for the library and I am also confused by the _long suffix since all functions have it. In my honest opinion you should also even avoid having u128_div and u128_div_rem and only provide the API that covers more functionality: u128_div_rem as well as u64_div_rem and u32_div_rem.

Another question: What is with i32, i64 and i128? Why is it not supported as well to provide a proper coverage for all primitive types?

So to be precise the proposed API by me is:

fn u128_div_rem(divident: u128, divisor: u128) -> (u128, u128);
fn u64_div_rem(divident: u64, divisor: u64) -> (u64, u64);
fn u32_div_rem(divident: u32, divisor: u32) -> (u32, u32);

All in all it is a really good job you have done here but needs some refinement. ;)

AaronKutch commented 6 years ago

I am finally implementing an efficient division implementation for your crate. Because division algorithms for computing the quotient also give the remainder as an (almost) free side effect, I am adding a function called aarons_algorithm (which is pub(crate)) which makes the quotient and remainder. All of the division and remainder functions become wrappers around this function. I have an #[inline(always)] attribute on just the aarons_algorithm (sometimes the compiler does not inline when it really should be). The compiler will then throw away as much code as it can when it knows that only the quotient or remainder is used, except for a wrapping_udiv_urem_assign function which I added.

When writing the algorithm I discovered that I could avoid allocations in more places than normal if I made the function signature for aarons_algorithm and wrapping_udiv_urem_assign into (lhs: &mut ApInt, rhs: &mut ApInt), and assigned the quotient to lhs and the remainder to rhs. This means that I have to do some cloning when wrapping aarons_algorithm with a function that assigns to only lhs, but the same allocation would still have to happen anyway if I did not.

To allow this, I think I should rename zip_access_data_mut to zip_access_data_mut_self and add a zip_access_data_mut_both.

Finally, do you think it is a good idea to inline(always) most things that are pub(crate) or private, especially the zip_access_data_mut related functions which I think will help with my strategy above?

AaronKutch commented 6 years ago

Just to clarify, I can avoid more allocations by both not needing to initialize an ApInt for the remainder, and because I can use rhs as a holder for intermediate computations before the remainder is assigned to it.

AaronKutch commented 6 years ago

I was almost done with the final part of the algorithm and I realized that it was more efficient to have lhs become the remainder and rhs become the quotient. This is due to the fact that lhs is subtracted from until it becomes the remainder. I did not realize this with the other algorithms and my work on specialized-div-rem because on the simpler cases there was room for shifting around stuff to make the problem mostly disappear. I started out with div before rem because we usually think of "divisor and remainder" instead of "remainder and divisor", but now it is obvious in hindsight to me that it is more natural to use the other. I think it will slightly benefit most of my quotient and remainder algorithms performance wise if I changed from a div_rem signature to a rem_div signature. Do you think it is worth it to change all these functions? EDIT: I figured out that I need both a div_rem and a rem_div function

Robbepop commented 6 years ago

We should hide this performance aspect of your implementation detail under the surface of our interface and should continue to call it div_rem as it is done in other APIs. So internally you could simply perform the swap which should be inlined away by the compiler anyway.

I am sorry that I do not respond so quickly at the moment. But my free time is very constraint at the moment. This will be better starting in July.

Please just continue with your work and it will be merged in the end! :) Thanks a lot for all the work!

AaronKutch commented 6 years ago

Do you have any comments about my comment from 3 days ago?

AaronKutch commented 6 years ago

I did a fresh clone of apint and I noticed a warning that from_hi_lo is not being used (but my future commits will use it). Can I modify that function to be from_lo_hi to make it little endian?

AaronKutch commented 6 years ago

I really want to expose a wrapping_udiv_urem_assign, a wrapping_urem_udiv_assign function (I have 2 specialized strategies for allocations now), and their signed versions. Because both arguments are modified, I don't feel that calling them with the usual way, apint0.wrapping_udiv_urem_assign(apint1), is appropriate and instead of self and other the signature uses lhs and rhs and the way to call the function becomes ApInt::wrapping_udiv_urem_assign(apint0,apint1). I am not adding into_ versions of these functions because it raises a lot of complexity, and just cloning instead makes it obvious where and how many allocations are happening. The only problem I have now is that ZipDataAccessMut and zip_access_data_mut does not supply both arguments as mutable. I think that I should rename zip_access_data_mut to zip_access_data_mut_self, ZipDataAccessMut to ZipDataAccessMutSelf, and add ZipDataAccessMutLhsRhs and zip_access_data_mut_lhs_rhs.

AaronKutch commented 6 years ago

Actually, can I change all the _hi_lo functions to _lo_hi?

AaronKutch commented 6 years ago

Right now I have almost no free time. In three or four weeks I will be able to finish my work on apint in about two weeks.

AaronKutch commented 6 years ago

@Robbepop I haven't heard from you in a month and I haven't received any responses back after I tried to email you.

Robbepop commented 6 years ago

Hey Aaron,

@Robbepop I haven't heard from you in a month and I haven't received any responses back after I tried to email you.

sorry I was very busy and couldn't focus on my open source projects. Also I haven't gotten any e-mail from you as far as I can see. Maybe it went accidentally to the spam or trash folder?

I am also not available next week but will hopefully and finally find time again for my open source projects at August the 18th. :)

I did a fresh clone of apint and I noticed a warning that from_hi_lo is not being used (but my future commits will use it). Can I modify that function to be from_lo_hi to make it little endian?

I am currently not sure about handling of little and big endianess in apint but I think such a decision is good.

Actually, can I change all the _hi_lo functions to _lo_hi?

You can and I think it is saner.

I really want to expose a wrapping_udiv_urem_assign, a wrapping_urem_udiv_assign function (I have 2 specialized strategies for allocations now), and their signed versions.

Naming them wrapping_udivrem_assign is more appropriate I think. But that's just taste.

I think that I should rename zip_access_data_mut to zip_access_data_mut_self, ZipDataAccessMut to ZipDataAccessMutSelf, and add ZipDataAccessMutLhsRhs and zip_access_data_mut_lhs_rhs.

I am more with ZipDataAccessMutBoth. ;)

Sorry for the short answers but as you already know my knowledge and perception about the apint code base is not up to date since I haven't put eyes on it for a longer time. But this will change in about 2 weeks I guess.

AaronKutch commented 6 years ago

I stopped getting comment notifications. I checked my Spam folders and sure enough I have also been missing a bunch of important github activity. I decided a while ago to email you directly but the only email I could find associated with you was robbepop@web.de from your github.io page. I just published a new specialized-div-rem version with some algorithm changes I developed when working on the long division implementation for apint. I will rename the functions according to your suggestions. I am a few days away from finishing my work on apint.

AaronKutch commented 6 years ago

Also I should mention that arithmetic.rs is becoming very long and it might be a good idea to split it up in the future. Maybe once Rust 2018 is stabilized the entire library can be refactored some (note that the mod.rs file will not be needed in Rust 2018).

AaronKutch commented 6 years ago

while writing debug code I found out that println!("{:?}",ApInt::from([0u64,1,0,0]).into_wrapping_sub(&ApInt::from([0u64,1,0,0]))); produces a nonzero answer. I fixed the subtraction function and hopefully I can split up my megacommit into smaller ones before submitting a pull request.

Robbepop commented 6 years ago

robbepop@web.de is my spam email so I normally do not look into it very much. The new specialized-div-rem version sounds promising. :)

Also I should mention that arithmetic.rs is becoming very long and it might be a good idea to split it up in  the future. Maybe once Rust 2018 is stabilized the entire library can be refactored some (note that the mod.rs file will not be needed in Rust 2018).

This sounds like a decent idea. I also do not like files that are getting too big and apint has a lot of really big files - luckily most of the time those files do not contain lots of logic. :)

while writing debug code I found out that println!("{:?}",ApInt::from([0u64,1,0,0]).into_wrapping_sub(&ApInt::from([0u64,1,0,0]))); produces a nonzero answer. I fixed the subtraction function and hopefully I can split up my megacommit into smaller ones before submitting a pull request.

This is terrible (that this bug really existed) and awesome (your fix) at the same time. Thank you very much for that!

AaronKutch commented 6 years ago

The algorithm is finally working. Tomorrow I will try to wrap everything up and push all my progress to my apint fork.

AaronKutch commented 6 years ago

I just pushed my progress to the apint fork for you to check. The test code is not complete (and some commented out printlns I will remove in the future), I will optimize the algorithm some more (such as seeing what impact on performance using smallvec and inlining has), and the signed divisions are not working yet. I will fix any problems you find and fix the commit situation before I submit a pull request.

AaronKutch commented 6 years ago

There will be 12 division functions when I am finished, but I chose to make the signed ones wrappers around the unsigned ones instead of inlining into all of them, so it shouldn't bloat compilation times.

Robbepop commented 6 years ago

What will the 12 division functions be? I am curious! I wish you good luck fixing the remaining signed versions. :) It's true smallvec can have extreme performance implications. I have experienced several in other crates. Nonetheless, I'd say its an expert feature for performance critical execution paths.

AaronKutch commented 6 years ago

My free time is about to become restricted again. I encountered some corner cases involving ApInts with bitwidths of 1, and I realized too late that my rotation function only works when the ApInts have no excess bits. I am mostly confident however that all the other functions work with no edge cases detected by the regular tests and a fuzzing test. I will postpone the rotation function for another release. There is nothing else that I see that should be changed in apints functionality from the outside, except for the one bit ApInt problems. Should I do a pull request with what I have now and you can update dependencies and release 0.3.0?

Robbepop commented 6 years ago

I would like you to wait for a PR until your implementations are correct. But for now I am okay with you doing some very simple fixes for the affected algorithms even though they might negatively impact performance: Plus tests that cover also the 1-bit width cases for example. Later we can easily improve performance with additional changes. But correctness is of most importance.

As always thank you very much for all the work on apint!! :)

AaronKutch commented 6 years ago

After implementing the fuzz tester I have encountered about 10 off-by-one errors and had to rewrite the core division loop twice lol. ApInts with bitwidth 1 haven't been giving me trouble though (earlier I had some bug with the testing code itself).

Robbepop commented 6 years ago

The division implementation seems to be really hard! Good to know that 1-bit width ApInt instances no longer trouble your implementation! Do you write regression unit tests for the failing fuzz tests you encounter? These could be very handy in the future if one is about to rewrite those functions. I hope you work is soon stable enough for usage in the main repo. :)

AaronKutch commented 6 years ago

I just got my algorithm to completely pass the fuzz tester for the first time, and all tests are working. I am now working a little on performance and will hopefully wrap up in the next two weeks. However, I am concerned about the casting::tests group which are marked as ignored when I run cargo test. Why is this to begin with and how do I test these?

Robbepop commented 6 years ago

Sounds awesome to me and I am very much looking forward to your work when everything is done! :)

I have to tell you something very terrible ... there are no tests for the casting operations, yet. There are just stubs that I created once I figured out a way I could test the casting operations in a nice way but I wasn't too happy about it and so it was never done. So yeah ... the entire set of casting operations (besides a few very special ones) are completely untested.

This is bad and I feel bad for it. However, since many operations rely on a working casting operation I think they are indirectly tested at least.

AaronKutch commented 6 years ago

Another thing I realized is that Rust 2018 is closer than we think. What if after I submit the pull request you install the RC1 beta, cargo fix, cargo test --release (the fuzz tester is quite expensive), refactor the file locations of the library to split up arithmetic.rs and mod.rs, review the library while doing so, and then do the little endian switch?

Robbepop commented 6 years ago

I am not sure when is the best time to convert crates to the new 2018 edition. Please let us discuss these changes in another issue. Besides that I think your recommended changes are decent. The arithmetic.rs file is really huge. I have also thought about Digit and DoubleDigit and their usefulness. In the num::bigint create these types are merely type aliases and now I do understand why they chose that design. The Digit and DoubleDigit wrappers are nice but add too much friction in the code in their current design - at least to my understanding. But yeah, please let us discuss those future changes in another issue. ;)

AaronKutch commented 6 years ago

I like the Digit and DoubleDigit wrappers because they allow me to directly control the way certain methods work. If I decide to put more work into apint in the future, it will probably involve making more special-purpose methods for them instead of macro and indexing spaghetti. In addition, if somebody wants to make the underlying types different and forks the repository they can relatively easily modify it. I will check the library for hard coded things and replace them with digit::BITS or other methods like I have my own code. On the other hand, I am not sure about Bit which I think could be replaced by bool.

I just realized that fixing up uint and int is going to be a pain now and in the future. I have a large amount of experience with macros and doing stuff like

macro_rules! impl {
    (
        ...,
        $($add_attr:meta),*;
        $($sub_attr:meta),*
    ) => {
        $(
            #[$add_attr]
        )*
        pub fn $add_name(...) -> ...

that way when calling the macro, I can inject doc = "..." and other things to customize. In cases where signed and unsigned ways of doing things are strictly different, we can also inject token trees and stuff. Maybe we should just bite the bullet instead and Ctrl-C Ctrl-V away.

I strongly suggest that we convert to Rust 2018 very soon - it will probably take until around stabilization that this crate finally gets relatively stable like it was before I found this crate. I already see things that could benefit from Rust 2018 and I think it would be inefficient to make two file refactoring passes instead of just one with Rust 2018 making certain things easier.

AaronKutch commented 6 years ago

Can I implement a function from_vec because I am making a corner case tester and it would make it much easier, or is there something else I could do?

Robbepop commented 6 years ago

There is already a more generic way to create Apint instances from iterators (and thus vectors). https://github.com/Robbepop/apint/blob/master/src/apint/constructors.rs#L162

I also dislike the fact that many parts especially in the int and uint module are Ctrl-C + Ctrl-V'ed but I am always critical about macro explosion in code since many times this leads to unreadable library code in the end that is hard to maintain. As of now the code in those modules is very easy to maintain but a bit unfortunate to modify. So it is a balance we have to take here.

The crate will surely convert to the Rust 2018 edition as soon as possible and it makes sense to undergo those huge modifications during that process - as you have said.

Maybe you are right about the usefulness of the digit wrappers. I am currently just a bit out-of-sync for the code base and have to work myself into it again some time.

AaronKutch commented 6 years ago

In order to use from_iter I have to do ApInt::from_iter(vec![32u64,234,23].iter().map(|x| Digit(*x))).unwrap() and there are two things here that are not public. Making Digit implement From<u64> would involve breaking the ability to change the type of Digit in the case of u128. Maybe we could have a function with a signature of Vec<Into<Digit>> but we would still run into problems with u128 and higher. I think there should exist a public From<Vec<u64>> function that takes a plain Vec<u64>. In fact, I just remembered one of my problems with bigint is the lack of an easy way for outside callers of the library to get and set parts of the ints via vectors of machine words. So, we should also have a to_vec_u64 or `Into<Vec> function.

Edit:

Actually, if the length of the Vec is zero then the function has to panic, so instead a from_vec_u64 function that returns a result should be used instead

AaronKutch commented 6 years ago

I was doing some compilation shenanigans, but I needed to pass a --no-default-features flag and the compiler is giving the error error: cannot find derive macro Serialize in this scope. It should compile with all flag combinations, right? I presume there is supposed to be

#[cfg_attr(serde_derive, Serialize)]
#[cfg_attr(serde_derive, Deserialize)]

which is what I am typing down.

AaronKutch commented 5 years ago

I should mention that I have significantly simplified the code for the function (there were some obvious duplications and rough edges that I missed) but it doesn't seem to affect performance any, so I will wait on doing a pull request for fixing up the function.

Robbepop / apint

Finalize implementation of the division and remainder routines #27