LiarPrincess commented 1 year ago

Please read the #242 Using tests from “Violet - Python VM written in Swift” before.

=== DO NOT MERGE! Discussion only. ===

🐰 Discussion

Those are the all of the basic operations. In the future we may add some more targeted tests for interesting operations (mul/div), but for now this should be enough.

It may also be a good idea to move all of the performance tests to a separate test target. Currently I am using PERFORMANCE_TEST compilation flag.

You can use Violet as a baseline measure, thought Violet was heavily optimized for Int32 range (see documentation). I could easily write something faster, but this is not my use-case.

String parsing

That said, I have to say that I find the String parsing performance quite bad. From my tests Violet is ~30 times faster.

Violet secret:

Instead of using a single BigInt and multiplying it by radix, we will group scalars into words-sized chunks. Then we will raise those chunks to appropriate power and add together.

For example: 1_2345_6789 = (1 * 10^8) + (2345 * 10^4) + (6789 * 10^0)

So, we are doing most of our calculations in fast Word, and then we switch to slow BigInt for a few final operations.

Implemented here. Should I create an issue for this?

Mac

I have 2014 rMBP -> mac 11.7 (Big Sur), Xcode 13.2.1, Intel. I assume that 9 years old machines are not exactly a priority for 🍎. Executing all of those tests take ~30 min (serial execution, DEBUG), and I can't be bothered to re-run them properly since it will throttle after a few minutes anyway…

Linux

Normally it would fail to compile:

PerformanceTests.generated.swift:27:23: error: cannot find type 'XCTMetric' in scope
private let metrics: [XCTMetric] = [XCTClockMetric()] // XCTMemoryMetric()?
                      ^~~~~~~~~

PerformanceTests.generated.swift:28:23: error: cannot find 'XCTMeasureOptions' in scope
private let options = XCTMeasureOptions.default
                      ^~~~~~~~~~~~~~~~~

But I wrote my own thing. Results below.

LiarPrincess commented 1 year ago

I wrote a test script that runs multiple BigInt implementations:

swift_numerics - branch from this PR
Violet - this branch which is a main branch with all swift_numerics tests from all PRs
- optimized for Int32 range. In this tests only 0, -1 and 1 fall into this range, so it should not matter.
Violet XsProMax - this branch. Violet implementation with following changes:
- no small inlined integer - magnitude is always stored on the heap
- no restrictions on the size - isNegative is stored in-line (and not on the heap like in Violet); count and capacity are on the heap because I don't want to stray too much from Violet.
```
struct BigInt {
  struct Header {
    var count: UInt32
    var capacity: UInt32
  }

  var flags: UInt8
  var buffer: ManagedBufferPointer<Header, UInt>
}
```
attaswift - this branch which is a main branch with added performance tests

Platform:

🐧 Ubuntu 22.04.1 LTS 64-bit
Intel Pentium(R) CPU G4600 @ 3.60GHz × 4
Swift 5.7.3 (swift-5.7.3-RELEASE) from Swift.org
Since XCTestCase.measure does not exist on Linux I wrote my own thing.

Important:

numbers up to 6400 bits (100 * UInt64.bitWidth in magnitude)
images, because GitHub does not allow colors; text available here (with relative standard deviation).
inout tests start with copying a given BigInt, so they may be slower than normal operation.
values in parens mean relative performance improvement. For example:

🐧 swift_numerics 🐧 Violet 🐧 attaswift

test_string_fromRadix10 0.2386572415 0.011443098625 (20.9x) 0.025662900375 (9.3x)

Means that Violet is 20.9 times faster than swift_numerics and attaswift is 9.3 times faster.

	🐧 swift_numerics	🐧 Violet	🐧 attaswift
test_string_fromRadix10	0.2386572415	0.011443098625 (20.9x)	0.025662900375 (9.3x)

string

1_string

This test involved String operations on 1203 BigInts.

Parsing from string:

attaswift is ~10x faster
Violet is 190x faster with radix 10 and 345x faster when radix is a power of 2
- Violet has an additional specialization for powers of 2
- implementation available here
already discussed in initial post (see above)

To string:

attaswift is 12.5x faster for radix 10 and 152x faster for radix 16
Violet is 21x faster for radix 10 and >2500x faster when radix is power of 2 (up to 3210x (!) faster for radix 16)
- Violet has an additional specialization for powers of 2.
- implementation available here
in swift_numerics radix 16 takes more time than radix 10! Normally bigger radix -> faster calculation (because we divide by bigger number). This is not normal.
divide-and-conquer algorithm exists, but none of the implementations has it

equal/compare

2_eq_cmp

This test involved ==/< operations on 1_447_209 BigInt pairs.

swift_numerics is the fastest - it just uses Swift.Array compare
attaswift/Violet XsProMax store sign inline, so very often you can get the result without even looking at the heap.
Violet is the slowest, though even a small difference would show here (small difference repeated 1_500_000 times becomes a big difference):
- handling Smi (inlined Int32) has a certain cost
- if we do not fit into Smi then we have to fetch sign from the heap
those operations are extremely fast, we called them ~1.5 million times! In Instruments we see swift_release and swift_retain with >10% each. I think that all of the implementations have sufficient performance.

unary

3_unary

This test involved +-~ operations on 100_203 BigInts.

Plus:

all of the projects use the default implementation from protocol
- if we wrote it by hand then it would be ~1000 slower (tested on Violet)
it is so fast that the relative standard deviation (not shown here) is basically a random number
nothing to improve here

Minus:

attaswift/Violet XsProMax are ~4x faster than swift_numerics/Violet
- in attaswift/Violet XsProMax unary '-' flips the sign and retains magnitude -> no memcpy
```
struct BigInt {
  struct Header {
    var count: UInt32
    var capacity: UInt32
  }

  // Bool and Array in case of 'attaswift'
  var flags: UInt8
  var buffer: ManagedBufferPointer<Header, UInt>
}
```
- Violet stores both sign and magnitude on the heap -> unary '-' has to memcpy and flip bit.
- I wanted to store the sign in the pointer to get the same behavior as attaswift, but I already use the tagged pointer to differentiate between small inlined Int32 and heap allocated storage. I didn't want to complicate it too much.
- swift_numerics uses 2-complement -> memcpy and then a few bit operations.

Invert:

all of the implementations have comparable performance
Violet is 0.653x in int test, but we are talking about 0.00803803s which is basically nothing
Violet XsProMax is 1.31x in big, but again fighting over 0.00922301s does not make a lot of sense
in Violet it is implemented as ~a = negate(a+1). Doing the a+1 by specific code is faster than general purpose add operation.

add, sub

4_add_sub

This test involved +- operations on 162_409 BigInt pairs.

Add:

Violet is 3x faster than swift_numerics
Violet XsProMax is 3.5x faster
attaswift is 10% slower

Sub:

Violet is 3.6x faster than swift_numerics
Violet XsProMax is 4.4x faster than swift_numerics
Violet is written by hand
swift_numerics uses a-b = a+(-b) which adds an additional operation (negation). It also means that - is slower than + which is weird.

mul, div, mod

5_mul_div_mod

This test involved */% operations on 91_809 BigInt pairs.

Mul:

Violet is 1.45x faster than swift_numerics.
school-book method (used by all of the implementations) is a pure ALU + write dependent, there is nothing 'fancy' to make it better.
- I tried the V8 method but it was 2x slower than the current Violet implementation. It may be because the implement more advanced algorithms, so their requirements/goals for the school-book method are a bit different.
- the code generated by the compiler in Violet is 'ok'. You could write something better by hand, but that is not the name of the game.
attaswift implements Karatsuba, but only applies it above directMultiplicationLimit: Int = 1024, so it was never used. Their base implementation 6x slower than Violet.

I am not sure how much time/effort they spend to get the directMultiplicationLimit value. 1024 seems kind of 'round' (because it is a power of 2). In most of the implementations I have seen ~40 words (Word = UInt) which would show in our performance tests.

If I was tasked with finding the correct value then I would look for a 'breaking point' where Karatsuba becomes faster than 'normal'. Similar to what the go lang did. Though attaswift 'normal' implementation is quite slow, so that may have affected their measurements (somehow).
we will probably need more detailed tests once we start implementing more advanced algorithms.

Div/mod:

Violet is 2.4x faster than swift_numerics.
- Knuth 'Algorithm D' with 3 Word quotient approximation
attaswift 4x slower than Violet.
we will probably need more detailed tests once we start implementing more advanced algorithms.

and, or, xor

6_and_or_xor

This test involved &|^ operations on 162_409 BigInt pairs.

Violet is 2x faster than swift_numerics; on paper swift_numerics should have an advantage:
- Violet stores sign + magnitude - the implementation was taken from GMP - it adds xor, add and overflow check to every word.
- swift_numerics stores 2 complement which is the best representation for this operation
small operations (int range) are weirdly slow in swift_numerics. Violet is 3.2x faster. Is there some costly preparation/initialization phase?
attaswift 7x slower than Violet.

shift

7_shift

This test involved shifting 20_203 BigInts by 5 values, in total 101_015 shifts per test.

Left:

Violet is 17x faster than swift_numerics
Violet XsProMax is 20x faster
attaswift is 10x faster
- there is something really weird in test_shiftLeft_big, the inout is fast, but the normal version is much slower. Ultra repeatable, occurs every time.

Right:

Violet is 20x faster than swift_numerics
Violet XsProMax is 23x faster
attaswift is 12x faster
- if I remember correctly attaswift rounding mode is different than Swift stdlib

pi

This test was suggested by Xiaodi Wu (@xwu) in BigInt #120, so we may ask them if we have any questions (maybe?). In theory it looks interesting, because it contains +-*/ operations.

In practice count 5000 has following input distribution (more detailed results here):

Operation	Count	Notes
add	39_280	Inputs of similar size up to 3500 words (`3500*UInt64` in magnitude)
sub	5_000	Mostly inputs of similar size up to 3500 words Some inputs where `rhs` is a single word (`UInt64`)
mul	104_086	`lhs` goes up to 3522 words `rhs` is always a single word (!!!)
div	22_678	Inputs of similar size up to 3500 words
compare (`>`)	16_602	Inputs of similar size up to 3500 words So fast that it does not matter

Anyway, here are the results:

8_pi

Results (looking at the 5000 test):

swift_numerics is slower than python (1.33 vs 0.56)
Violet is faster than Python (0.51 vs 0.56)
Violet XsProMax
- slower than Python (0.88 vs 0.56)
- slower than Violet (0.88 vs 0.51) <- plot twist!

Violet XsProMax being slower than Violet is a surprise because in most of the tests above XsProMax was noticeably faster (5-10%). Though, Violet has an ace up its sleeve: it is really fast for small integers and this test performs 104_086 multiplications where rhs falls into Int32. This was not visible in the tests above because they concentrated on big numbers.

As for the swift_numerics vs attaswift: result is dominated by */. Other operations (namely: +-) are so few/fast that they are not even noticeable.

TLDR; Conclusions

Operation	Possible improvement	Note
`init(String)`	190x for radix 10 345x when radix is a power of 2	Discussed in initial post
`toString`	21.5x for radix 10 >2500x when radix is a power of 2 (up to 3210x)	Radix 16 is slower than radix 10?
Equality	-
Compare	-
Unary `+`	-	Do not write by hand, it is 1000x slower
Unary `-`	4x	Storing `sign` inline and `magnitude` on the heap is much faster (obviously, but now we have numbers)
Unary `~`	-
Binary `+`	3x
Binary `-`	4.4x	Implemented as `a-b = a+(-b)`, writing it by hand may be faster (`Violet` does this)
Binary `*`	1.45x
Binary `/%`	2.4x
Binary `&\\|^`	2x
Left shift	17x
Right shift	20x
pi	1.6x 2.6x if we store small values inline	This test uses a lot of small integers. `*/` dominate `+-`.

LiarPrincess commented 1 year ago

Update 21.2.2023:

added equatable, comparable and π tests
improved Violet performance
updated the posts above with new results

Anyway, from those tests we can see that Violet is the fastest. Under the hood Violet uses ManagedBufferPointer. If swift_numerics decides to go with this route:

be careful when calling [ManagedBufferPointer.isUniqueReference()](https://developer.apple.com/documentation/swift/managedbufferpointer/isuniquereference()) for COW.

Doing this check on every write is expensive and it will NOT be optimized by the compiler. This matters in tight loops (multiplication etc.).

In Violet I went with a token based system where calling BigIntStorage.guaranteeUniqueBufferReference() gives you UniqueBufferToken. All of the mutating operations require this token. This way we do the COW check only once. Bonus points for being checked at the compile time (can't call mutating operation without a token).

go into UnsafeBufferPointer for final operations.

This ensures max performance. This is mostly visible in mul and div. Sometimes allocating a separate buffer for result and then copying it into BigInt may be faster.

For example the inner logic of division has following signature:

private static func inner(
lhsCount: Int,
rhsCount: Int,
lhsWhichBecomesRemainder lhs: UnsafeMutableBufferPointer<Word>,
rhs: UnsafeBufferPointer<Word>,
result: UnsafeMutableBufferPointer<Word>
) {
dot dot dot
}

It also means that you will be using withXXX constructs very often:

lhs.withMutableWordsBuffer(lhsToken) { lhs in
remainder.withMutableWordsBuffer(remainderToken) { remainder in
  Self.inner(
    lhsCount: lhsCount,
    rhsCount: rhsCount,
    lhsWhichBecomesRemainder: remainder,
    rhs: shiftedRhs,
    result: lhs
  )
}
}

Obviously you can quite easily write something that will be faster than Violet. Violet is heavily geared towards small integers (Int32) which closes certain optimization opportunities.

LiarPrincess commented 1 year ago

Update 20.3.2023:

Added Violet XsProMax to the post above
Added detailed pi test input distribution

apple / swift-numerics

[BigInt tests][No merge] 🐰 How NOT to write performance tests #256

🐰 Discussion

String parsing

Mac

Linux

string

equal/compare

unary

add, sub

mul, div, mod

and, or, xor

shift

pi

TLDR; Conclusions