Hi ! I may be doing something wrong, but on my test on a M1, doing an atomic increment is significantly slower than a simple lock. Perhaps the ARM doesn't have an atomic increment intrinsic and it's falling back to a good old lock + some more overhead ?
print("Multithreaded simple CPU loop")
print("-----------------------------")
inside = 0
let lock = NSRecursiveLock()
startTime = CFAbsoluteTimeGetCurrent()
DispatchQueue.concurrentPerform(iterations: 1000) { (_) in
for _ in 1...iteration/1000 {
let rand_x : Float = Float(drand48())
let rand_y : Float = Float(drand48())
let origin_dist : Float = rand_x * rand_x + rand_y * rand_y
if (origin_dist <= 1) {
lock.lock()
inside += 1
lock.unlock()
}
}
}
print("CPU loop PI : \(Double(4 * inside) / Double(iteration))")
print("Time : \(CFAbsoluteTimeGetCurrent() - startTime)\n\n")
//--------------------------
print("Multithreaded atomic CPU loop")
print("-----------------------------")
let atomicinside = ManagedAtomic<Int>(0)
startTime = CFAbsoluteTimeGetCurrent()
DispatchQueue.concurrentPerform(iterations: 1000) { (_) in
for _ in 1...iteration/1000 {
let rand_x : Float = Float(drand48())
let rand_y : Float = Float(drand48())
let origin_dist : Float = rand_x * rand_x + rand_y * rand_y
if (origin_dist <= 1) {
atomicinside.wrappingIncrement(ordering: .relaxed)
}
}
}
print("CPU loop PI : ", (Double(4 * atomicinside.load(ordering: .relaxed)) / Double(iteration)))
print("Time : \(CFAbsoluteTimeGetCurrent() - startTime)\n\n")
//--------------------------
Multithreaded simple CPU loop
CPU loop PI : 3.14173064
Time : 16.48445498943329
Multithreaded atomic CPU loop
CPU loop PI : 3.14095848
Time : 19.735566020011902
I know that multithreading something as simple doesn't make much sense (in fact, the single threaded simple loop is faster than the multithread due to the threading overhead) but I didn't expect the atomic increment to be slower than a simple lock.
Am I doing something wrong ? is it the Arm ? is Atomic actually slower even on intel (I don't have an intel based Mac to test it) ?
Hi ! I may be doing something wrong, but on my test on a M1, doing an atomic increment is significantly slower than a simple lock. Perhaps the ARM doesn't have an atomic increment intrinsic and it's falling back to a good old lock + some more overhead ?
I know that multithreading something as simple doesn't make much sense (in fact, the single threaded simple loop is faster than the multithread due to the threading overhead) but I didn't expect the atomic increment to be slower than a simple lock.
Am I doing something wrong ? is it the Arm ? is Atomic actually slower even on intel (I don't have an intel based Mac to test it) ?
PS : I'm using "swift-atomics 0.0.2"