JuliaMolSim / Molly.jl

Molecular simulation in Julia
Other
372 stars 51 forks source link

Parallel v2: Change parallelization approach (`parallel` to `n_threads`) #76

Closed JaydevSR closed 1 year ago

JaydevSR commented 2 years ago

The following changes are made in the PR:

  1. The kwarg parallel is replace by n_threads throughout the library which can be used to specify the number of threads.
  2. Use of @floop instead of @threads for the use of n_threads .

Changes yet to be done:

JaydevSR commented 2 years ago

I am almost done with the find_neighbors reduction. It looks something like this:

function find_neighbors_floop(s::System,
                        nf::DistanceNeighborFinder,
                        current_neighbors=nothing,
                        step_n::Integer=0;
                        n_threads::Integer=Threads.nthreads())
    !iszero(step_n % nf.n_steps) && return current_neighbors

    if isnothing(current_neighbors)
        neighbors = NeighborList()
    else
        neighbors = current_neighbors
    end
    empty!(neighbors)

    sqdist_cutoff = nf.dist_cutoff ^ 2

    @floop ThreadedEx(basesize = length(s) ÷ n_threads) for i in 1:length(s)
        ci = s.coords[i]
        nbi = @view nf.nb_matrix[:, i]
        w14i = @view nf.matrix_14[:, i]
        for j in 1:(i - 1)
            r2 = sum(abs2, vector(ci, s.coords[j], s.boundary))
            if r2 <= sqdist_cutoff && nbi[j]
                nn = (i, j, w14i[j])
                @reduce(neighbors_list = append!(Tuple{Int, Int, Bool}[], (nn,)))
            end
        end
    end

    return append!(neighbors, neighbors_list)
end

The only thing that I wanted suggestions on is if I change the function to:

function find_neighbors_floop(s::System,
                        nf::DistanceNeighborFinder,
                        current_neighbors=nothing,
                        step_n::Integer=0;
                        n_threads::Integer=Threads.nthreads())
    !iszero(step_n % nf.n_steps) && return current_neighbors
    sqdist_cutoff = nf.dist_cutoff ^ 2

    @floop ThreadedEx(basesize = length(s) ÷ n_threads) for i in 1:length(s)
        ci = s.coords[i]
        nbi = @view nf.nb_matrix[:, i]
        w14i = @view nf.matrix_14[:, i]
        for j in 1:(i - 1)
            r2 = sum(abs2, vector(ci, s.coords[j], s.boundary))
            if r2 <= sqdist_cutoff && nbi[j]
                nn = (i, j, w14i[j])
                @reduce(neighbors_list = append!(Tuple{Int, Int, Bool}[], (nn,)))
            end
        end
    end

    return NeighborList(length(neighbors_list), neighbors_list)
end

This does exactly the same job as the previous version as the allocations are already made by @reduce and using append! on the current neighbor saves no time.

codecov[bot] commented 2 years ago

Codecov Report

Merging #76 (0e0129c) into master (63cb6d7) will decrease coverage by 0.36%. The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master      #76      +/-   ##
==========================================
- Coverage   86.24%   85.87%   -0.37%     
==========================================
  Files          30       30              
  Lines        3127     3116      -11     
==========================================
- Hits         2697     2676      -21     
- Misses        430      440      +10     
Impacted Files Coverage Δ
src/Molly.jl 100.00% <ø> (ø)
src/types.jl 88.17% <0.00%> (-8.61%) :arrow_down:
src/force.jl 83.05% <38.46%> (-2.92%) :arrow_down:
src/loggers.jl 95.68% <86.66%> (ø)
src/neighbors.jl 91.89% <100.00%> (+0.48%) :arrow_up:
src/simulators.jl 99.39% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 63cb6d7...0e0129c. Read the comment docs.

jgreener64 commented 2 years ago

Looks great. Yes that change to find_neighbors is fine. I would prioritise the forces reduction over find_neighbors and TimeCorrelationLogger since that sees the most use.

It would be useful to see if the benchmark results change with these changes. This would involve running benchmark/benchmarks.jl (instructions at the top of the file) and benchmark/protein.jl (which is just a script) on a machine with multiple cores with and without this PR. As long as performance doesn't drop then it's fine. Don't worry about the GPU benchmarks.

JaydevSR commented 2 years ago

I tried using reductions for forces function, but It was turning out to be too complicated. So I ended up with this in this routine in the end:

function forces(s::System{D, false}, neighbors=nothing; n_threads::Integer=Threads.nthreads()) where D
    n_atoms = length(s)

    # allocate only `missing` for unused threads instead of a large array
    T = typeof(ustrip_vec.(s.coords))
    fs_threads = Union{Missing, T}[missing for i in 1:Threads.nthreads()]
    thread_used = falses(Threads.nthreads())

    for inter in values(s.pairwise_inters)
        if inter.nl_only
            if isnothing(neighbors)
                error("An interaction uses the neighbor list but neighbors is nothing")
            end
            @floop ThreadedEx(basesize = neighbors.n ÷ n_threads) for ni in 1:neighbors.n
                FLoops.@init id = Threads.threadid()
                if !thread_used[id]
                    fs_threads[id] = ustrip_vec.(zero(s.coords))
                    thread_used[id] = true
                end
                i, j, w = neighbors.list[ni]
                Molly.force!(fs_threads[id], inter, s, i, j, s.force_units, w)
            end
        else
            @floop ThreadedEx(basesize = n_atoms ÷ n_threads) for i in 1:n_atoms
                FLoops.@init id = Threads.threadid()
                if !thread_used[id]
                    fs_threads[id] = ustrip_vec.(zero(s.coords))
                    thread_used[id] = true
                end
                for j in 1:(i - 1)
                    Molly.force!(fs_threads[id], inter, s, i, j, s.force_units)
                end
            end
        end
    end

    fs = ustrip_vec.(zero(s.coords))
    for id in 1:Threads.nthreads()
        if thread_used[id]
            fs .+= fs_threads[id]
        end
    end

    for inter_list in values(s.specific_inter_lists)
        sparse_vec = specific_force(inter_list, s.coords, s.boundary, s.force_units, n_atoms)
        fs .+= ustrip_vec.(Array(sparse_vec))
    end

    for inter in values(s.general_inters)
        # Force type not checked
        fs .+= ustrip_vec.(forces(inter, s, neighbors))
    end

    return fs * s.force_units
end

I was thinking of using fs = sum(skipmissing(fs_threads) before but it was not working as in some cases sum requires the zero of the type it is summing over which is not defined on our case. Due to this I had the add threads_used vector. This saves on memory and time for smaller number of threads used.

The only other problem that I am seeing right now is that, the test https://github.com/JuliaMolSim/Molly.jl/blob/3b5d2b1f4808fab8570f2c9b307dd11dc63c3c33/test/simulation.jl#L365 is failing. Which after some discussion on julia slack, I think is because @floops dynamically schedules the tasks so the order of tasks on threads is not same for each run. This leads to differences of the order of eps() which is equal to 2.220446049250313e-16 for me as floating point arithmatic is not associative (the value that sum(accel_diff) returns is Quantity{Float64, 𝐋 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, nm^-1, mol^-1, u^-1), 𝐋 𝐍^-1 𝐓^-2, nothing}}[-3.1712913545201005e-16 kJ nm^-1 mol^-1 u^-1, 2.207435623180487e-16 kJ nm^-1 mol^-1 u^-1, -2.2724877535296173e-16 kJ nm^-1 mol^-1 u^-1] where all elements are of order eps()). Maybe the test can be rewritten to check values elementwise to be approximate if this is not a very big issue from simulation perspective. Something like this:

a1 = accelerations(s, neighbors)
a2 = accelerations(s_nounits, neighbors_nounits)u"kJ * mol^-1 * nm^-1 * u^-1"
@test all(all(a1[i] .≈ a2[i]) for i in eachindex(a1)) == true

@jgreener64 If I can go ahead with this then I will commit these changes and report the benchmark.

jgreener64 commented 2 years ago

Yes give it a go, I'd be interested in seeing the benchmarks. Changing that test is fine.

JaydevSR commented 1 year ago

Sorry for the delays, here are the benchmarks:

Without changes

Job Properties

Results

Below is a table of this job's results, obtained by running the benchmarks. The values listed in the ID column have the structure [parent_group, child_group, ..., key], and can be used to index into the BaseBenchmarks suite to retrieve the corresponding benchmarks. The percentages accompanying time and memory values in the below table are noise tolerances. The "true" time/memory value for a given benchmark is expected to fall within this percentage of the reported value. An empty cell means that the value was zero.

ID time GC time memory allocations
["interactions", "Coulomb energy"] 3.200 ns (5%)
["interactions", "Coulomb force"] 4.100 ns (5%)
["interactions", "HarmonicBond energy"] 4.100 ns (5%)
["interactions", "HarmonicBond force"] 7.100 ns (5%)
["interactions", "LennardJones energy"] 6.500 ns (5%)
["interactions", "LennardJones force"] 7.800 ns (5%)
["protein", "in-place NL parallel"] 1.630 s (5%) 16.238 ms 778.94 MiB (1%) 11413
["simulation", "in-place NL parallel"] 68.282 ms (5%) 92.34 MiB (1%) 25602
["simulation", "in-place NL"] 63.716 ms (5%) 45.62 MiB (1%) 10907
["simulation", "in-place f32"] 389.161 ms (5%) 27.46 MiB (1%) 10879
["simulation", "in-place parallel"] 173.878 ms (5%) 85.40 MiB (1%) 24523
["simulation", "in-place"] 381.249 ms (5%) 45.22 MiB (1%) 10880
["simulation", "out-of-place NL"] 141.180 ms (5%) 221.44 MiB (1%) 17834
["simulation", "out-of-place f32"] 1.094 s (5%) 58.440 ms 957.59 MiB (1%) 16344
["simulation", "out-of-place"] 1.173 s (5%) 143.314 ms 1.85 GiB (1%) 16345
["spatial", "vector"] 4.100 ns (5%)
["spatial", "vector1D"] 2.300 ns (5%)

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

Julia versioninfo

Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
      Microsoft Windows [Version 10.0.22622.290]
  CPU: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz: 
              speed         user         nice          sys         idle          irq
       #1  1800 MHz    8661406            0      5136593    132044078       486000  ticks
       #2  1800 MHz    6440359            0      2141406    137259640        37015  ticks
       #3  1800 MHz   11921468            0      4730437    129189515       124781  ticks
       #4  1800 MHz    8069671            0      2425656    135346078        36500  ticks
       #5  1800 MHz    9257140            0      3903109    132681156        87687  ticks
       #6  1800 MHz    6544859            0      2768109    136528437        45093  ticks
       #7  1800 MHz    8648484            0      3522203    133670718        79859  ticks
       #8  1800 MHz    7197796            0      2910765    135732843        51000  ticks

  Memory: 7.911350250244141 GB (2353.859375 MB free)
  Uptime: 515542.0 sec
  Load Avg:  0.0  0.0  0.0
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

With Floops.jl

Job Properties

Results

Below is a table of this job's results, obtained by running the benchmarks. The values listed in the ID column have the structure [parent_group, child_group, ..., key], and can be used to index into the BaseBenchmarks suite to retrieve the corresponding benchmarks. The percentages accompanying time and memory values in the below table are noise tolerances. The "true" time/memory value for a given benchmark is expected to fall within this percentage of the reported value. An empty cell means that the value was zero.

ID time GC time memory allocations
["interactions", "Coulomb energy"] 3.200 ns (5%)
["interactions", "Coulomb force"] 4.100 ns (5%)
["interactions", "HarmonicBond energy"] 4.100 ns (5%)
["interactions", "HarmonicBond force"] 9.710 ns (5%)
["interactions", "LennardJones energy"] 6.507 ns (5%)
["interactions", "LennardJones force"] 7.808 ns (5%)
["protein", "in-place NL parallel"] 4.455 s (5%) 36.157 ms 665.59 MiB (1%) 13560
["simulation", "in-place NL parallel"] 96.624 ms (5%) 93.23 MiB (1%) 42533
["simulation", "in-place NL"] 87.767 ms (5%) 54.43 MiB (1%) 16962
["simulation", "in-place f32"] 395.863 ms (5%) 28.56 MiB (1%) 16085
["simulation", "in-place parallel"] 228.786 ms (5%) 74.06 MiB (1%) 34116
["simulation", "in-place"] 406.107 ms (5%) 47.25 MiB (1%) 16086
["simulation", "out-of-place NL"] 157.556 ms (5%) 11.495 ms 226.76 MiB (1%) 17831
["simulation", "out-of-place f32"] 1.001 s (5%) 52.517 ms 957.59 MiB (1%) 16341
["simulation", "out-of-place"] 1.254 s (5%) 151.738 ms 1.85 GiB (1%) 16342
["spatial", "vector"] 3.500 ns (5%)
["spatial", "vector_1D"] 2.600 ns (5%)

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

Julia versioninfo

Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
      Microsoft Windows [Version 10.0.22622.290]
  CPU: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz: 
              speed         user         nice          sys         idle          irq
       #1  1800 MHz    8981265            0      5262000    136231359       495390  ticks
       #2  1800 MHz    6658812            0      2184703    141630437        37515  ticks
       #3  1800 MHz   12445343            0      4917578    133111046       127578  ticks
       #4  1800 MHz    8474062            0      2482984    139516906        37093  ticks
       #5  1800 MHz    9686515            0      4026343    136761093        89359  ticks
       #6  1800 MHz    6781875            0      2873421    140818656        46203  ticks
       #7  1800 MHz    9083109            0      3624734    137766093        81187  ticks
       #8  1800 MHz    7482703            0      2994078    139997171        51875  ticks

  Memory: 7.911350250244141 GB (2537.6015625 MB free)
  Uptime: 520174.0 sec
  Load Avg:  0.0  0.0  0.0
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

With an implementation using pure Julia

(I realized later that an implementation of using n_threads without Floops.jl was straightforward, I also benchmarked it as I didn't find using Floops.jl to be any better than this)

Job Properties

Results

Below is a table of this job's results, obtained by running the benchmarks. The values listed in the ID column have the structure [parent_group, child_group, ..., key], and can be used to index into the BaseBenchmarks suite to retrieve the corresponding benchmarks. The percentages accompanying time and memory values in the below table are noise tolerances. The "true" time/memory value for a given benchmark is expected to fall within this percentage of the reported value. An empty cell means that the value was zero.

ID time GC time memory allocations
["interactions", "Coulomb energy"] 3.200 ns (5%)
["interactions", "Coulomb force"] 4.000 ns (5%)
["interactions", "HarmonicBond energy"] 4.100 ns (5%)
["interactions", "HarmonicBond force"] 7.200 ns (5%)
["interactions", "LennardJones energy"] 6.500 ns (5%)
["interactions", "LennardJones force"] 7.800 ns (5%)
["protein", "in-place NL parallel"] 1.548 s (5%) 8.766 ms 703.06 MiB (1%) 14551
["simulation", "in-place NL parallel"] 74.642 ms (5%) 91.15 MiB (1%) 41563
["simulation", "in-place NL"] 113.148 ms (5%) 52.41 MiB (1%) 27856
["simulation", "in-place f32"] 461.711 ms (5%) 26.74 MiB (1%) 25532
["simulation", "in-place parallel"] 176.175 ms (5%) 81.97 MiB (1%) 32074
["simulation", "in-place"] 436.558 ms (5%) 42.68 MiB (1%) 25533
["simulation", "out-of-place NL"] 149.989 ms (5%) 220.68 MiB (1%) 22236
["simulation", "out-of-place f32"] 1.109 s (5%) 53.302 ms 957.71 MiB (1%) 20545
["simulation", "out-of-place"] 1.186 s (5%) 135.357 ms 1.85 GiB (1%) 20546
["spatial", "vector"] 4.200 ns (5%)
["spatial", "vector_1D"] 2.000 ns (5%)

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

Julia versioninfo

Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
      Microsoft Windows [Version 10.0.22622.290]
  CPU: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz: 
              speed         user         nice          sys         idle          irq
       #1  1800 MHz   10084656            0      5730796    146417234       525437  ticks
       #2  1800 MHz    7418031            0      2391609    152422375        39875  ticks
       #3  1800 MHz   13796375            0      5415328    143020328       135875  ticks
       #4  1800 MHz    9811546            0      2773312    149647156        39750  ticks
       #5  1800 MHz   10899484            0      4502187    146830343        96078  ticks
       #6  1800 MHz    7570281            0      3204093    151457640        49656  ticks
       #7  1800 MHz   10411234            0      4017750    147803031        85328  ticks
       #8  1800 MHz    8357531            0      3356828    150517656        55265  ticks

  Memory: 7.911350250244141 GB (3223.921875 MB free)
  Uptime: 531932.0 sec
  Load Avg:  0.0  0.0  0.0
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

protein.jl benchmark

Without changes

124.347514 seconds (187.72 k allocations: 8.983 GiB, 1.00% gc time)
 53.451386 seconds (261.16 k allocations: 12.740 GiB, 2.51% gc time)
System with 15954 atoms, box size Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}  [5.676 nm, 5.6627 nm, 6.2963000000000005 nm]

With FLoops.jl

214.031827 seconds (210.31 k allocations: 8.091 GiB, 0.69% gc time, 0.00% compilation time)
 83.212934 seconds (330.99 k allocations: 10.605 GiB, 1.85% gc time)
System with 15954 atoms, boundary CubicBoundary{Quantity{Float64, 𝐋 , Unitful.FreeUnits{(nm,), 𝐋 , nothing}}}(Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}  [5.676 nm, 5.6627 nm, 6.2963000000000005 nm])

With pure Julia

136.278100 seconds (249.48 k allocations: 7.559 GiB, 3.17% gc time)
 53.304665 seconds (320.60 k allocations: 11.312 GiB, 2.14% gc time)
System with 15954 atoms, boundary CubicBoundary{Quantity{Float64, 𝐋 , Unitful.FreeUnits{(nm,), 𝐋 , nothing}}}(Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing
}}[5.676 nm, 5.6627 nm, 6.2963000000000005 nm])
JaydevSR commented 1 year ago

The pure Julia implementation benchmark seem to be close to no changes for n_threads=Threads.nthreads() i.e. parallel=true, for non-parallel case I thing it can be improved by using a different case for n_threads=1 (which I think is slower due to overhead of calling @threads). For now It is something like:

function forces(s::System{D, false}, neighbors=nothing; n_threads::Integer=Threads.nthreads()) where D
    n_atoms = length(s)
    fs_chunks = [ustrip_vec.(zero(s.coords)) for i=1:n_threads]

    for inter in values(s.pairwise_inters)
        if inter.nl_only
            if isnothing(neighbors)
                error("An interaction uses the neighbor list but neighbors is nothing")
            end
            basesize = max(1, Int(ceil(neighbors.n / n_threads)))
            chunks = [i:min(i + basesize - 1, neighbors.n) for i in 1:basesize:neighbors.n]
            # @threads need collect to be used
            Threads.@threads for (id, rng) in collect(enumerate(chunks))
                for ni in rng
                    i, j, w = neighbors.list[ni]
                    Molly.force!(fs_chunks[id], inter, s, i, j, s.force_units, w)
                end
            end
        else
            basesize = max(1, Int(ceil(n_atoms / n_threads)))
            chunks = [i:min(i + basesize - 1, n_atoms) for i in 1:basesize:n_atoms]
            Threads.@threads for (id, rng) in collect(enumerate(chunks))
                for i in rng
                    for j in 1:(i - 1)
                        Molly.force!(fs_chunks[id], inter, s, i, j, s.force_units)
                    end
                end
            end
        end
    end

    fs = sum(fs_chunks)

    for inter_list in values(s.specific_inter_lists)
        sparse_vec = specific_force(inter_list, s.coords, s.boundary, s.force_units, n_atoms)
        fs .+= ustrip_vec.(Array(sparse_vec))
    end

    for inter in values(s.general_inters)
        # Force type not checked
        fs .+= ustrip_vec.(forces(inter, s, neighbors))
    end

    return fs * s.force_units
end
jgreener64 commented 1 year ago

Thanks for the benchmarks. If the implementation without changes is performing better or on-par with the others then it might just be best to merge this PR as-is and leave optimising the force summation to another time? Then you can get on with the REMD stuff using whatever approach is appropriate there, FLoops.jl or not.

JaydevSR commented 1 year ago

The find_neighbors reductions seems to still performs way better than the current version in dev molly and implementation without using @floop according to some benchmarks by me (possibly because it is a natural reduction that FLoops.jl is able to optimize). So maybe it is better to keep it that way?

Benchmarks:

Using a system with 1000 atoms and DistanceNeighborFinder

  1. With Floops: a. on 1 thread: 14.921 ms (17 allocations: 4.06 MiB) b. on 8 threads: 6.473 ms (147 allocations: 16.22 MiB)
  2. Without FLoops: a. on 1 thread: 15.677 ms (60 allocations: 7.12 MiB) b. on 8 threads: 11.547 ms (122 allocations: 12.16 MiB)
jgreener64 commented 1 year ago

Yes use FLoops.jl for find_neighbors then, that's a good speedup. But leave the force summation as it was before this PR, given the improvement was unclear?

jgreener64 commented 1 year ago

Let me know if you are making any further changes, otherwise I will merge.

JaydevSR commented 1 year ago

No I am not making anymore change, you can merge this. Maybe I will try to revisit the forces function in the future.

jgreener64 commented 1 year ago

Okay I just understood why you were pushing for the forces.jl change - because currently any case of n_threads greater than 1 will use all threads. I was under the impression that currently it could use n_threads and the change was just changing the batching, but that's clearly not the case. That change is hence essential to control n_threads for REMD.

Could you add the forces.jl change back and then I'll merge, thanks.

JaydevSR commented 1 year ago

Done. I also removed using Base.Threads as I was explicitly using Threads namespace everywhere so there is no need for that.

jgreener64 commented 1 year ago

:tada: