Atomic operations on array elements

chengchingwen commented 5 years ago

The atomic methods for Atomic call the pointer_from_objref and call the function on with llvmcall. If we expose the pointer api, we can do atomic operation on array more easily. Are there any reason to dispatch the atomic operation only on Atomic type?

JeffBezanson commented 4 years ago

Ah, I see, this differs from #29943. That is asking for atomic operations on pointers, while this is about doing atomic operations via pointers to other structures like arrays. I will try to clarify the title.

oschulz commented 4 years ago

This recently came up here: https://discourse.julialang.org/t/parallelizing-a-histogram/41260/9

Would be great to have atomic_add!(A::Array, x, idxs), etc.

tkf commented 4 years ago

Quoting my comment in the discourse thread:

OK, so I uploaded the code I used for trying out atomic increments on array elements: https://github.com/tkf/ParallelIncrements.jl

(Comment added: I just realized that the name of the package is completely misleading since it was about single thread performance. Initially, I was thinking about trying it with multiple threads.)

This package contains a function ref = atomicref(::Array, indices...) which can be used to for atomic operations like Threads.atomic_add!(ref, x).

As you can see in the README, the overhead is 10x in my laptop. It can go up to 100x and possibly more depending on the kind of optimization the compiler can do. It might be possible to get a lower overhead using other memory orderings but I haven't tried it yet.

I'm not a big fan in this direction but I'm interested in whatever beneficial for HPC in Julia. So, I'm just sharing it just in case some people want to try it out (and maybe find some performance bugs I made in the sample code or something).

oschulz commented 4 years ago

Came across this - that's on a GPU, though: https://devblogs.nvidia.com/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/

oschulz commented 4 years ago

A slowdown of 10x may actually be quite acceptable in some cases, compared to the alternative.

Obviously, when you deal with things like small histograms (histograms are just a nice example use case for this), it's most efficient for every thread to histogram separately in 1st level cache, and then merge after. But if the histogram is large (multi-dimensional, many bins), atomics with a 10x overhead (compared to non-atomic writes may still be faster) than maintaining one histogram per thread, especially if the number of threads is large. Just a hunch, of course. I guess it'll come down to memory bandwidths, latencies and caching behavior in the individual use case ... hard to predict.

@tfk, to you think the atomicref comes with some additional overhead, compared to having an atomic_add that directly operates on array and index, or will the compiler optimize that away?

tkf commented 4 years ago

I agree there might be some situations that this can be beneficial. I uploaded the code because I thought it'd be great if people can explore the use of it.

But, let me also note that my benchmark is about single-thread performance. If you are using multiple threads, atomic can hurt even more because it has to invalidate the cache of all CPUs. So, I'd guess 10x is more like a lower bound of the overhead.

atomicref comes with some additional overhead, compared to having an atomic_add that directly operates on array and index

I didn't see anything suspicious about it when I quickly look at LLVM IR.

(BTW, you might want to use @tkf to ping me, to avoid pinging another user.)

oschulz commented 4 years ago

If you are using multiple threads, atomic can hurt even more because it has to invalidate the cache of all CPUs.

Yes, I've been thinking about that - however, if the access pattern is scattered access to a fairly large amount of memory, the threads would, for the most part, not share cache lines, right?

tkf commented 4 years ago

Ah, yes, you might be right.

vtjnash commented 4 years ago

For a histogram-type operation, I'd expect a (histogram-shaped distribution) of cache conflicts, resulting in a large slowdown for each additional thread you add—starting at 10x slower just for going from single-threaded to 1 thread-but-threadsafe-with atomics. Unlike @tkf, I have no numbers though, and I'm just somewhat parroting his comment.

This is actually relevant, because I'm currently thinking about not (yet) having atomics for individual array elements (just too likely to be slow and/or incorrect), so that you should perhaps instead just use a lock over the whole array. https://docs.google.com/document/d/e/2PACX-1vT7Ibthj9WyM8s5bcQbiKsVK6MtvzqmnPFMy-bcjZLlbqv55a0_sTJ99AkbvPIZk3t7MbhZ57NzaIzC/pub

vtjnash commented 4 years ago

(note: the idealized cost of lock is on the order of a couple atomic operations, since that's about all it takes to implement one: it's important to mentally realize that the cost of a lock and an atomic are therefore somewhat nearly the same.)

oschulz commented 4 years ago

Well, you kinda dash my hopes for efficient scattered atomic adds (maybe I can try to do some parallel benchmarking to get us numbers, though). :-)

But I'm very excited about the ideas in your manifest!

tkf commented 4 years ago

I wonder what's the best strategy for implementing histogram-like function with a very large output array using threads. Maybe (1) create a batch of local index list (for the entries to be incremented) in each thread grouped by sub-regions of the output array and then (2) send it to the "writer" tasks that write it to the sub-region of the output array? This way, I guess we can distribute the contention a bit by dividing it by the number of sub-regions (the contention in this strategy would be coming from the queue for communicating with each writer task).

oschulz commented 4 years ago

EDIT: I messed something up here, see below for corrections. Using a global lock is not fast, for this use case.

@vtjnash, you were right, of course: Using a global lock is indeed fastest. Here's a benchmark I ran (using Julia v1.5.0-beta1.0):

https://gist.github.com/oschulz/dce9d2f2104deb2ff42edaa814bb3790

I'm filling a histogram with 201 x 201 x 201 bins with 10^6 (randn(rng), randn(rng), randn(rng)) variates, using a separate Random123.Philox4x RNG for each thread.

I defined

Base.@propagate_inbounds function atomic_addindex!(A::Array{Int64}, v::U, I...) where {U}
    T = Int64
    i = LinearIndices(size(A))[I...]
    v_conv = convert(T, v)
    ptr = pointer(A, i)
    Base.Threads.llvmcall("%ptr = inttoptr i64 %0 to i64*\n%rv = atomicrmw add i64* %ptr, i64 %1 acq_rel\nret i64 %rv\n", T, Tuple{Ptr{T},T}, ptr, v_conv)::T
    A
end

I hope that's correct.

With JULIA_NUM_THREADS=64 (Epyc-2 CPU, 64 physical cores, single socket), I get:

No multi-threading: 205.771 ms
Using atomic_addindex!: 519.649 ms
Using a global SpinLock(): 4.784 ms

So with the global lock, I get a 43x speedup on 64 threads - I wouldn't have expected it to scale that well.

The random number generation itself is about 20 ms using a thread and 630 μs using all 64 threads, so it's only a relatively small fraction of the total time.

When I run with JULIA_NUM_THREADS=1:

No multi-threading: 203.243 ms ms
Using atomic_addindex!: 224.736 ms
Using a global SpinLock(): 195.301 ms

So if there's a bit of work to do (computing the random numbers and hist-bin index), the overhead of atomic_addindex! has little impact, single threaded. But with many threads, even on a large array, atomic access seems to cause way too much cache invalidation/contention.

Yay for global locks! :-)

EDIT: I messed something up here, see below for corrections. Using a global lock is not fast, for this use case.

biochem-fan commented 3 years ago

@oschulz I looked at your code on Gist.

Your fill_randn_mt_atomic! actually uses a global SpinLock, while fill_randn_mt_lock! uses atomic_push!. Is this intentional?

Where is your atomic_addindex! used? Shouldn't your atomic_push! call it?

biochem-fan commented 3 years ago

@oschulz I confirmed that without using atomic_addindex!, the result is not correct.

I added sanity checks https://gist.github.com/biochem-fan/19072418e90dd783227a347efe892860/revisions and repaired function names.

The results with 24 threads on Xeon E5-2643 v2 are:

No multi-threading: 255.538 ms
Atomic add: 38.816 ms
Global spin lock: 578.194 ms

So, atomic add is the fastest.

oschulz commented 3 years ago

Hi @biochem-fan , thanks for catching this. No idea what I did back there, looks like I definitely mixed the function names up and my atomic_addindex! isn't even used in my old version my Gist. Maybe I accidentally used and uploaded a debugging state of the code or so - must have been late ... sorry!

In any case, I just updated my Gist (https://gist.github.com/oschulz/dce9d2f2104deb2ff42edaa814bb3790) to use atomic_addindex! (similar to your change) and to use locking only during the actual memory manipulation (for locked_push!).

Here's some benchmarks on a 128-core machine (2x AMD EPYC 7662 64-core). The locking approach clearly doesn't scale, while atomic_addindex! does scale, though not linearly.

(We also have to take reduced CPU clock boost under heavy multi-threading into account, CPU ran at 3.3 GHz during fill_randn_mt_atomic! with 1 thread, but only at 2.4 GHz average with 64 threads).

Using 1 threads (mean time):

gen_randn_st: 12.353 ms
gen_randn_mt: 12.271 ms
fill_randn_st!: 122.237 ms
fill_randn_mt_atomic!: 124.936 ms
fill_randn_mt_lock!: 147.139 ms

Using 2 threads (mean time):

gen_randn_st: 12.383 ms
gen_randn_mt: 6.404 ms
fill_randn_st!: 123.549 ms
fill_randn_mt_atomic!: 67.433 ms
fill_randn_mt_lock!: 145.086 ms

Using 4 threads (mean time):

gen_randn_st: 12.395 ms
gen_randn_mt: 4.330 ms
fill_randn_st!: 122.585 ms
fill_randn_mt_atomic!: 35.574 ms
fill_randn_mt_lock!: 161.797 ms

Using 8 threads (mean time):

gen_randn_st: 12.326 ms
gen_randn_mt: 3.102 ms
fill_randn_st!: 122.782 ms
fill_randn_mt_atomic!: 23.418 ms
fill_randn_mt_lock!: 300.381 ms

Using 16 threads (mean time):

gen_randn_st: 12.154 ms
gen_randn_mt: 1.648 ms
fill_randn_st!: 123.109 ms
fill_randn_mt_atomic!: 15.220 ms
fill_randn_mt_lock!: 424.703 ms

Using 32 threads (mean time):

gen_randn_st: 12.397 ms
gen_randn_mt: 911.524 μs
fill_randn_st!: 122.468 ms
fill_randn_mt_atomic!: 8.906 ms
fill_randn_mt_lock!: 455.146 ms

Using 64 threads (mean time):

gen_randn_st: 12.153 ms
gen_randn_mt: 647.767 μs
fill_randn_st!: 121.362 ms
fill_randn_mt_atomic!: 5.512 ms
fill_randn_mt_lock!: 503.848 ms

Using 128 threads (mean time):

gen_randn_st: 12.120 ms
gen_randn_mt: 639.225 μs
fill_randn_st!: 149.159 ms
fill_randn_mt_atomic!: 5.153 ms
fill_randn_mt_lock!: 844.348 ms

I don't understand the results for 128 threads, though - even though things are spread over two NUMA domains now, it should be faster than with 64 threads, at least the pure RNG generation should scale well. I did use thread pinning, so each thread should have a separate physical core (no hyper-threading). Can Julia somehow only schedule 64 threads even if started with 128?

biochem-fan commented 3 years ago

@oschulz Thank you very much for confirmation and more test results.

Indeed your result for 128 threads is interesting. Unfortunately I don't have access to machines with so many cores and I cannot try it.

This new results demonstrate that atomic_addindex! is useful at least in some cases. I am new to Julia developer community. How should we move forward? PR #37683 seems stale.

Until atomic_addindex! gets incorporated to Julia, may I use your implementation in my projects?

oschulz commented 3 years ago

Until atomic_addindex! gets incorporated to Julia, may I use your implementation in my projects?

Of course!

I agree that having something like atomic_addindex! in Julia would be beneficial. I think the Julia dev team has some long-term plans regarding parallel access and memory mutability in general, but I don't know any details. But having something simple like atomic_addindex! IMHO couldn't hurt.

Alternatively, this could go into a little package. I'm not sure how fragile this would be, with the LLVM intrinsics, regarding newer Julia (and therefore newer LLVM versions). The advantage would be that other packages could use it without depending on a newer Julia version.

oschulz commented 3 years ago

Maybe a little package (e.g. "AtomicArrays.jl") would be best for now.

jpsamaroo commented 3 years ago

Does #37847 actually add atomic array accesses? I don't see any tests for this functionality, although I could just be overlooking them.

vtjnash commented 3 years ago

No, I eventually noticed that there aren't often algorithms that would benefit from it. Though internally, we use structures that benefit from it: the access pattern used by IdDict (and others), internally already do benefit from something similar. However, I'm already proposing making that particular usage (concurrent-reader-with-writer, single-writer-with-lock) access pattern safe by default, such that it wouldn't require anything different. But since that can now be simulated with an atomic-release fence, it isn't strictly necessary to add either.

tkf commented 3 years ago

I don't think this issue is fully addressed. Let me reopen this issue since I believe it's reasonable to keep tracking it (but, of course, I'm open to be convinced otherwise).

I do agree that supporting atomic operations on array element can be misleading for people who are not familiar with parallel programming. The best approach is to create per-task mutable states and merge them wisely. I'm very serious about this to the degree that I build a GitHub org JuliaFolds and a series of posts (https://juliafolds.github.io/data-parallelism/) around this principle.

Having said that, I believe that atomic operations on arrays are an important ingredient for trickier and interesting parallel and concurrent programs. I think addressing this issue would be important for Julia to keep staying relevant in high-performance technical computing.

there aren't often algorithms that would benefit from it

I don't believe this is the case. There are plenty of practical nonblocking algorithms that require, e.g., CAS on an entry of an array. For example:

Linked Concurrent Ring Queue (LCRQ): Morrison, Adam, and Yehuda Afek. 2013. “Fast Concurrent Queues for X86 Processors.” https://doi.org/10.1145/2442516.2442527.
Open addressing hash table: e.g., Maier, Tobias, Peter Sanders, and Roman Dementiev. 2019. “Concurrent Hash Tables: Fast and General(?)!” https://doi.org/10.1145/3309206.
Ctrie: Prokopec, Aleksandar. 2018. “Cache-Tries: Concurrent Lock-Free Hash Tries with Constant-Time Operations.” https://doi.org/10.1145/3178487.3178498. See also Ctrie - Wikipedia

These data structures would be relevant in high-performance programming access pattern is sparse (= very little contention) and/or resulting object is relatively large compared to available memory.

An extreme example is GPU where there are just so many threads and so moderate sized object (in CPU standard) cannot be allocated per-thread basis. This necessitates the use of atomics when computing, e.g., histogram. Indeed, GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell | NVIDIA Developer Blog does not even have "non-atomic" example and explains why adding atomics to a more level of memory was beneficial.

Edit: I forgot to mention that when it comes to asynchronous updates on a large data structure, Chaotic Relaxation (Chazan and Miranker, 1969) is also an interesting example. There are also relatively recent examples of similar concepts in stochastic gradient descent (Recht et al., 2011) and parallel graph analytics (Lenharth, Nguyen, and Pingali (2016)).

tkf commented 2 years ago

FWIW, just as a stopgap measure until we get this in julia proper, there's https://github.com/JuliaConcurrent/UnsafeAtomics.jl. There's also a wrapper https://github.com/JuliaConcurrent/AtomicArrays.jl but it's not yet registered.

JuliaLang / julia

Atomic operations on array elements #32455