Closed JaydevSR closed 1 year ago
I am almost done with the find_neighbors
reduction. It looks something like this:
function find_neighbors_floop(s::System,
nf::DistanceNeighborFinder,
current_neighbors=nothing,
step_n::Integer=0;
n_threads::Integer=Threads.nthreads())
!iszero(step_n % nf.n_steps) && return current_neighbors
if isnothing(current_neighbors)
neighbors = NeighborList()
else
neighbors = current_neighbors
end
empty!(neighbors)
sqdist_cutoff = nf.dist_cutoff ^ 2
@floop ThreadedEx(basesize = length(s) ÷ n_threads) for i in 1:length(s)
ci = s.coords[i]
nbi = @view nf.nb_matrix[:, i]
w14i = @view nf.matrix_14[:, i]
for j in 1:(i - 1)
r2 = sum(abs2, vector(ci, s.coords[j], s.boundary))
if r2 <= sqdist_cutoff && nbi[j]
nn = (i, j, w14i[j])
@reduce(neighbors_list = append!(Tuple{Int, Int, Bool}[], (nn,)))
end
end
end
return append!(neighbors, neighbors_list)
end
The only thing that I wanted suggestions on is if I change the function to:
function find_neighbors_floop(s::System,
nf::DistanceNeighborFinder,
current_neighbors=nothing,
step_n::Integer=0;
n_threads::Integer=Threads.nthreads())
!iszero(step_n % nf.n_steps) && return current_neighbors
sqdist_cutoff = nf.dist_cutoff ^ 2
@floop ThreadedEx(basesize = length(s) ÷ n_threads) for i in 1:length(s)
ci = s.coords[i]
nbi = @view nf.nb_matrix[:, i]
w14i = @view nf.matrix_14[:, i]
for j in 1:(i - 1)
r2 = sum(abs2, vector(ci, s.coords[j], s.boundary))
if r2 <= sqdist_cutoff && nbi[j]
nn = (i, j, w14i[j])
@reduce(neighbors_list = append!(Tuple{Int, Int, Bool}[], (nn,)))
end
end
end
return NeighborList(length(neighbors_list), neighbors_list)
end
This does exactly the same job as the previous version as the allocations are already made by @reduce
and using append!
on the current neighbor saves no time.
Merging #76 (0e0129c) into master (63cb6d7) will decrease coverage by
0.36%
. The diff coverage is85.71%
.
@@ Coverage Diff @@
## master #76 +/- ##
==========================================
- Coverage 86.24% 85.87% -0.37%
==========================================
Files 30 30
Lines 3127 3116 -11
==========================================
- Hits 2697 2676 -21
- Misses 430 440 +10
Impacted Files | Coverage Δ | |
---|---|---|
src/Molly.jl | 100.00% <ø> (ø) |
|
src/types.jl | 88.17% <0.00%> (-8.61%) |
:arrow_down: |
src/force.jl | 83.05% <38.46%> (-2.92%) |
:arrow_down: |
src/loggers.jl | 95.68% <86.66%> (ø) |
|
src/neighbors.jl | 91.89% <100.00%> (+0.48%) |
:arrow_up: |
src/simulators.jl | 99.39% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 63cb6d7...0e0129c. Read the comment docs.
Looks great. Yes that change to find_neighbors
is fine. I would prioritise the forces
reduction over find_neighbors
and TimeCorrelationLogger
since that sees the most use.
It would be useful to see if the benchmark results change with these changes. This would involve running benchmark/benchmarks.jl
(instructions at the top of the file) and benchmark/protein.jl
(which is just a script) on a machine with multiple cores with and without this PR. As long as performance doesn't drop then it's fine. Don't worry about the GPU benchmarks.
I tried using reductions for forces
function, but It was turning out to be too complicated. So I ended up with this in this routine in the end:
function forces(s::System{D, false}, neighbors=nothing; n_threads::Integer=Threads.nthreads()) where D
n_atoms = length(s)
# allocate only `missing` for unused threads instead of a large array
T = typeof(ustrip_vec.(s.coords))
fs_threads = Union{Missing, T}[missing for i in 1:Threads.nthreads()]
thread_used = falses(Threads.nthreads())
for inter in values(s.pairwise_inters)
if inter.nl_only
if isnothing(neighbors)
error("An interaction uses the neighbor list but neighbors is nothing")
end
@floop ThreadedEx(basesize = neighbors.n ÷ n_threads) for ni in 1:neighbors.n
FLoops.@init id = Threads.threadid()
if !thread_used[id]
fs_threads[id] = ustrip_vec.(zero(s.coords))
thread_used[id] = true
end
i, j, w = neighbors.list[ni]
Molly.force!(fs_threads[id], inter, s, i, j, s.force_units, w)
end
else
@floop ThreadedEx(basesize = n_atoms ÷ n_threads) for i in 1:n_atoms
FLoops.@init id = Threads.threadid()
if !thread_used[id]
fs_threads[id] = ustrip_vec.(zero(s.coords))
thread_used[id] = true
end
for j in 1:(i - 1)
Molly.force!(fs_threads[id], inter, s, i, j, s.force_units)
end
end
end
end
fs = ustrip_vec.(zero(s.coords))
for id in 1:Threads.nthreads()
if thread_used[id]
fs .+= fs_threads[id]
end
end
for inter_list in values(s.specific_inter_lists)
sparse_vec = specific_force(inter_list, s.coords, s.boundary, s.force_units, n_atoms)
fs .+= ustrip_vec.(Array(sparse_vec))
end
for inter in values(s.general_inters)
# Force type not checked
fs .+= ustrip_vec.(forces(inter, s, neighbors))
end
return fs * s.force_units
end
I was thinking of using fs = sum(skipmissing(fs_threads)
before but it was not working as in some cases sum
requires the zero of the type it is summing over which is not defined on our case. Due to this I had the add threads_used
vector. This saves on memory and time for smaller number of threads used.
The only other problem that I am seeing right now is that, the test https://github.com/JuliaMolSim/Molly.jl/blob/3b5d2b1f4808fab8570f2c9b307dd11dc63c3c33/test/simulation.jl#L365 is failing. Which after some discussion on julia slack, I think is because @floops
dynamically schedules the tasks so the order of tasks on threads is not same for each run. This leads to differences of the order of eps()
which is equal to 2.220446049250313e-16
for me as floating point arithmatic is not associative (the value that sum(accel_diff)
returns is Quantity{Float64, 𝐋 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, nm^-1, mol^-1, u^-1), 𝐋 𝐍^-1 𝐓^-2, nothing}}[-3.1712913545201005e-16 kJ nm^-1 mol^-1 u^-1, 2.207435623180487e-16 kJ nm^-1 mol^-1 u^-1, -2.2724877535296173e-16 kJ nm^-1 mol^-1 u^-1]
where all elements are of order eps()
). Maybe the test can be rewritten to check values elementwise to be approximate if this is not a very big issue from simulation perspective. Something like this:
a1 = accelerations(s, neighbors)
a2 = accelerations(s_nounits, neighbors_nounits)u"kJ * mol^-1 * nm^-1 * u^-1"
@test all(all(a1[i] .≈ a2[i]) for i in eachindex(a1)) == true
@jgreener64 If I can go ahead with this then I will commit these changes and report the benchmark.
Yes give it a go, I'd be interested in seeing the benchmarks. Changing that test is fine.
Sorry for the delays, here are the benchmarks:
JULIA_NUM_THREADS => 8
Below is a table of this job's results, obtained by running the benchmarks.
The values listed in the ID
column have the structure [parent_group, child_group, ..., key]
, and can be used to
index into the BaseBenchmarks suite to retrieve the corresponding benchmarks.
The percentages accompanying time and memory values in the below table are noise tolerances. The "true"
time/memory value for a given benchmark is expected to fall within this percentage of the reported value.
An empty cell means that the value was zero.
ID | time | GC time | memory | allocations |
---|---|---|---|---|
["interactions", "Coulomb energy"] |
3.200 ns (5%) | |||
["interactions", "Coulomb force"] |
4.100 ns (5%) | |||
["interactions", "HarmonicBond energy"] |
4.100 ns (5%) | |||
["interactions", "HarmonicBond force"] |
7.100 ns (5%) | |||
["interactions", "LennardJones energy"] |
6.500 ns (5%) | |||
["interactions", "LennardJones force"] |
7.800 ns (5%) | |||
["protein", "in-place NL parallel"] |
1.630 s (5%) | 16.238 ms | 778.94 MiB (1%) | 11413 |
["simulation", "in-place NL parallel"] |
68.282 ms (5%) | 92.34 MiB (1%) | 25602 | |
["simulation", "in-place NL"] |
63.716 ms (5%) | 45.62 MiB (1%) | 10907 | |
["simulation", "in-place f32"] |
389.161 ms (5%) | 27.46 MiB (1%) | 10879 | |
["simulation", "in-place parallel"] |
173.878 ms (5%) | 85.40 MiB (1%) | 24523 | |
["simulation", "in-place"] |
381.249 ms (5%) | 45.22 MiB (1%) | 10880 | |
["simulation", "out-of-place NL"] |
141.180 ms (5%) | 221.44 MiB (1%) | 17834 | |
["simulation", "out-of-place f32"] |
1.094 s (5%) | 58.440 ms | 957.59 MiB (1%) | 16344 |
["simulation", "out-of-place"] |
1.173 s (5%) | 143.314 ms | 1.85 GiB (1%) | 16345 |
["spatial", "vector"] |
4.100 ns (5%) | |||
["spatial", "vector1D"] |
2.300 ns (5%) |
Here's a list of all the benchmark groups executed by this job:
["interactions"]
["protein"]
["simulation"]
["spatial"]
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
Microsoft Windows [Version 10.0.22622.290]
CPU: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz:
speed user nice sys idle irq
#1 1800 MHz 8661406 0 5136593 132044078 486000 ticks
#2 1800 MHz 6440359 0 2141406 137259640 37015 ticks
#3 1800 MHz 11921468 0 4730437 129189515 124781 ticks
#4 1800 MHz 8069671 0 2425656 135346078 36500 ticks
#5 1800 MHz 9257140 0 3903109 132681156 87687 ticks
#6 1800 MHz 6544859 0 2768109 136528437 45093 ticks
#7 1800 MHz 8648484 0 3522203 133670718 79859 ticks
#8 1800 MHz 7197796 0 2910765 135732843 51000 ticks
Memory: 7.911350250244141 GB (2353.859375 MB free)
Uptime: 515542.0 sec
Load Avg: 0.0 0.0 0.0
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
JULIA_NUM_THREADS => 8
Below is a table of this job's results, obtained by running the benchmarks.
The values listed in the ID
column have the structure [parent_group, child_group, ..., key]
, and can be used to
index into the BaseBenchmarks suite to retrieve the corresponding benchmarks.
The percentages accompanying time and memory values in the below table are noise tolerances. The "true"
time/memory value for a given benchmark is expected to fall within this percentage of the reported value.
An empty cell means that the value was zero.
ID | time | GC time | memory | allocations |
---|---|---|---|---|
["interactions", "Coulomb energy"] |
3.200 ns (5%) | |||
["interactions", "Coulomb force"] |
4.100 ns (5%) | |||
["interactions", "HarmonicBond energy"] |
4.100 ns (5%) | |||
["interactions", "HarmonicBond force"] |
9.710 ns (5%) | |||
["interactions", "LennardJones energy"] |
6.507 ns (5%) | |||
["interactions", "LennardJones force"] |
7.808 ns (5%) | |||
["protein", "in-place NL parallel"] |
4.455 s (5%) | 36.157 ms | 665.59 MiB (1%) | 13560 |
["simulation", "in-place NL parallel"] |
96.624 ms (5%) | 93.23 MiB (1%) | 42533 | |
["simulation", "in-place NL"] |
87.767 ms (5%) | 54.43 MiB (1%) | 16962 | |
["simulation", "in-place f32"] |
395.863 ms (5%) | 28.56 MiB (1%) | 16085 | |
["simulation", "in-place parallel"] |
228.786 ms (5%) | 74.06 MiB (1%) | 34116 | |
["simulation", "in-place"] |
406.107 ms (5%) | 47.25 MiB (1%) | 16086 | |
["simulation", "out-of-place NL"] |
157.556 ms (5%) | 11.495 ms | 226.76 MiB (1%) | 17831 |
["simulation", "out-of-place f32"] |
1.001 s (5%) | 52.517 ms | 957.59 MiB (1%) | 16341 |
["simulation", "out-of-place"] |
1.254 s (5%) | 151.738 ms | 1.85 GiB (1%) | 16342 |
["spatial", "vector"] |
3.500 ns (5%) | |||
["spatial", "vector_1D"] |
2.600 ns (5%) |
Here's a list of all the benchmark groups executed by this job:
["interactions"]
["protein"]
["simulation"]
["spatial"]
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
Microsoft Windows [Version 10.0.22622.290]
CPU: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz:
speed user nice sys idle irq
#1 1800 MHz 8981265 0 5262000 136231359 495390 ticks
#2 1800 MHz 6658812 0 2184703 141630437 37515 ticks
#3 1800 MHz 12445343 0 4917578 133111046 127578 ticks
#4 1800 MHz 8474062 0 2482984 139516906 37093 ticks
#5 1800 MHz 9686515 0 4026343 136761093 89359 ticks
#6 1800 MHz 6781875 0 2873421 140818656 46203 ticks
#7 1800 MHz 9083109 0 3624734 137766093 81187 ticks
#8 1800 MHz 7482703 0 2994078 139997171 51875 ticks
Memory: 7.911350250244141 GB (2537.6015625 MB free)
Uptime: 520174.0 sec
Load Avg: 0.0 0.0 0.0
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
(I realized later that an implementation of using n_threads
without Floops.jl was straightforward, I also benchmarked it as I didn't find using Floops.jl to be any better than this)
JULIA_NUM_THREADS => 8
Below is a table of this job's results, obtained by running the benchmarks.
The values listed in the ID
column have the structure [parent_group, child_group, ..., key]
, and can be used to
index into the BaseBenchmarks suite to retrieve the corresponding benchmarks.
The percentages accompanying time and memory values in the below table are noise tolerances. The "true"
time/memory value for a given benchmark is expected to fall within this percentage of the reported value.
An empty cell means that the value was zero.
ID | time | GC time | memory | allocations |
---|---|---|---|---|
["interactions", "Coulomb energy"] |
3.200 ns (5%) | |||
["interactions", "Coulomb force"] |
4.000 ns (5%) | |||
["interactions", "HarmonicBond energy"] |
4.100 ns (5%) | |||
["interactions", "HarmonicBond force"] |
7.200 ns (5%) | |||
["interactions", "LennardJones energy"] |
6.500 ns (5%) | |||
["interactions", "LennardJones force"] |
7.800 ns (5%) | |||
["protein", "in-place NL parallel"] |
1.548 s (5%) | 8.766 ms | 703.06 MiB (1%) | 14551 |
["simulation", "in-place NL parallel"] |
74.642 ms (5%) | 91.15 MiB (1%) | 41563 | |
["simulation", "in-place NL"] |
113.148 ms (5%) | 52.41 MiB (1%) | 27856 | |
["simulation", "in-place f32"] |
461.711 ms (5%) | 26.74 MiB (1%) | 25532 | |
["simulation", "in-place parallel"] |
176.175 ms (5%) | 81.97 MiB (1%) | 32074 | |
["simulation", "in-place"] |
436.558 ms (5%) | 42.68 MiB (1%) | 25533 | |
["simulation", "out-of-place NL"] |
149.989 ms (5%) | 220.68 MiB (1%) | 22236 | |
["simulation", "out-of-place f32"] |
1.109 s (5%) | 53.302 ms | 957.71 MiB (1%) | 20545 |
["simulation", "out-of-place"] |
1.186 s (5%) | 135.357 ms | 1.85 GiB (1%) | 20546 |
["spatial", "vector"] |
4.200 ns (5%) | |||
["spatial", "vector_1D"] |
2.000 ns (5%) |
Here's a list of all the benchmark groups executed by this job:
["interactions"]
["protein"]
["simulation"]
["spatial"]
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
Microsoft Windows [Version 10.0.22622.290]
CPU: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz:
speed user nice sys idle irq
#1 1800 MHz 10084656 0 5730796 146417234 525437 ticks
#2 1800 MHz 7418031 0 2391609 152422375 39875 ticks
#3 1800 MHz 13796375 0 5415328 143020328 135875 ticks
#4 1800 MHz 9811546 0 2773312 149647156 39750 ticks
#5 1800 MHz 10899484 0 4502187 146830343 96078 ticks
#6 1800 MHz 7570281 0 3204093 151457640 49656 ticks
#7 1800 MHz 10411234 0 4017750 147803031 85328 ticks
#8 1800 MHz 8357531 0 3356828 150517656 55265 ticks
Memory: 7.911350250244141 GB (3223.921875 MB free)
Uptime: 531932.0 sec
Load Avg: 0.0 0.0 0.0
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
124.347514 seconds (187.72 k allocations: 8.983 GiB, 1.00% gc time)
53.451386 seconds (261.16 k allocations: 12.740 GiB, 2.51% gc time)
System with 15954 atoms, box size Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}} [5.676 nm, 5.6627 nm, 6.2963000000000005 nm]
214.031827 seconds (210.31 k allocations: 8.091 GiB, 0.69% gc time, 0.00% compilation time)
83.212934 seconds (330.99 k allocations: 10.605 GiB, 1.85% gc time)
System with 15954 atoms, boundary CubicBoundary{Quantity{Float64, 𝐋 , Unitful.FreeUnits{(nm,), 𝐋 , nothing}}}(Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}} [5.676 nm, 5.6627 nm, 6.2963000000000005 nm])
136.278100 seconds (249.48 k allocations: 7.559 GiB, 3.17% gc time)
53.304665 seconds (320.60 k allocations: 11.312 GiB, 2.14% gc time)
System with 15954 atoms, boundary CubicBoundary{Quantity{Float64, 𝐋 , Unitful.FreeUnits{(nm,), 𝐋 , nothing}}}(Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing
}}[5.676 nm, 5.6627 nm, 6.2963000000000005 nm])
The pure Julia implementation benchmark seem to be close to no changes for n_threads=Threads.nthreads()
i.e. parallel=true
, for non-parallel case I thing it can be improved by using a different case for n_threads=1
(which I think is slower due to overhead of calling @threads
). For now It is something like:
function forces(s::System{D, false}, neighbors=nothing; n_threads::Integer=Threads.nthreads()) where D
n_atoms = length(s)
fs_chunks = [ustrip_vec.(zero(s.coords)) for i=1:n_threads]
for inter in values(s.pairwise_inters)
if inter.nl_only
if isnothing(neighbors)
error("An interaction uses the neighbor list but neighbors is nothing")
end
basesize = max(1, Int(ceil(neighbors.n / n_threads)))
chunks = [i:min(i + basesize - 1, neighbors.n) for i in 1:basesize:neighbors.n]
# @threads need collect to be used
Threads.@threads for (id, rng) in collect(enumerate(chunks))
for ni in rng
i, j, w = neighbors.list[ni]
Molly.force!(fs_chunks[id], inter, s, i, j, s.force_units, w)
end
end
else
basesize = max(1, Int(ceil(n_atoms / n_threads)))
chunks = [i:min(i + basesize - 1, n_atoms) for i in 1:basesize:n_atoms]
Threads.@threads for (id, rng) in collect(enumerate(chunks))
for i in rng
for j in 1:(i - 1)
Molly.force!(fs_chunks[id], inter, s, i, j, s.force_units)
end
end
end
end
end
fs = sum(fs_chunks)
for inter_list in values(s.specific_inter_lists)
sparse_vec = specific_force(inter_list, s.coords, s.boundary, s.force_units, n_atoms)
fs .+= ustrip_vec.(Array(sparse_vec))
end
for inter in values(s.general_inters)
# Force type not checked
fs .+= ustrip_vec.(forces(inter, s, neighbors))
end
return fs * s.force_units
end
Thanks for the benchmarks. If the implementation without changes is performing better or on-par with the others then it might just be best to merge this PR as-is and leave optimising the force summation to another time? Then you can get on with the REMD stuff using whatever approach is appropriate there, FLoops.jl or not.
The find_neighbors
reductions seems to still performs way better than the current version in dev molly and implementation without using @floop
according to some benchmarks by me (possibly because it is a natural reduction that FLoops.jl is able to optimize). So maybe it is better to keep it that way?
Benchmarks:
Using a system with 1000 atoms and DistanceNeighborFinder
14.921 ms (17 allocations: 4.06 MiB)
b. on 8 threads: 6.473 ms (147 allocations: 16.22 MiB)
15.677 ms (60 allocations: 7.12 MiB)
b. on 8 threads: 11.547 ms (122 allocations: 12.16 MiB)
Yes use FLoops.jl for find_neighbors
then, that's a good speedup. But leave the force summation as it was before this PR, given the improvement was unclear?
Let me know if you are making any further changes, otherwise I will merge.
No I am not making anymore change, you can merge this. Maybe I will try to revisit the forces
function in the future.
Okay I just understood why you were pushing for the forces.jl change - because currently any case of n_threads
greater than 1 will use all threads. I was under the impression that currently it could use n_threads
and the change was just changing the batching, but that's clearly not the case. That change is hence essential to control n_threads
for REMD.
Could you add the forces.jl change back and then I'll merge, thanks.
Done. I also removed using Base.Threads
as I was explicitly using Threads
namespace everywhere so there is no need for that.
:tada:
The following changes are made in the PR:
parallel
is replace byn_threads
throughout the library which can be used to specify the number of threads.@floop
instead of@threads
for the use ofn_threads
.Changes yet to be done:
forces
(still usesThreads.@threads
for passing tests)find_neighbors
log_property!
method forTimeCorrelationLogger