Closed leios closed 1 year ago
It's not the clearest error but the key line is [3] OutOfGPUMemoryError
, indicating that there isn't enough GPU memory for the test.
The package makes poor use of GPU memory at the minute due to the requirements of Zygote, so the 4 GB of a GTX 970 isn't enough to run the 16k atoms in the protein test. Sorry about that, hopefully it will change in future.
Woops, that's completely my bad. Sorry for the random issue then!
No it's fine, good to see people are using the software.
I was just testing this on a P100 with 16 GB of available RAM and ran into a related issue:
OpenMM protein comparison: Error During Test at /home/leios/projects/CESMIX/Molly.jl/test/protein.jl:57
Got exception outside of a @test
Out of GPU memory trying to allocate 1.896 GiB
Effective GPU memory usage: 99.90% (15.766 GiB/15.782 GiB)
Memory pool usage: 2.372 GiB (3.219 GiB reserved)
Stacktrace:
[1] macro expansion
@ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:320 [inlined]
[2] macro expansion
@ ./timing.jl:382 [inlined]
[3] #_alloc#170
@ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:313 [inlined]
[4] #alloc#169
@ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:299 [inlined]
[5] alloc
@ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:293 [inlined]
[6] CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}(#unused#::UndefInitializer, dims::Tuple{Int64, Int64})
@ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:42
[7] similar
@ ~/.julia/packages/CUDA/DfvRa/src/array.jl:164 [inlined]
[8] permutedims(B::CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}, perm::Tuple{Int64, Int64})
@ Base ./multidimensional.jl:1560
[9] DistanceVecNeighborFinder(; nb_matrix::CuArray{Bool, 2, CUDA.Mem.DeviceBuffer}, matrix_14::CuArray{Bool, 2, CUDA.Mem.DeviceBuffer}, n_steps::Int64, dist_cutoff::Quantity{Float64, π, Unitful.FreeUnits{(nm,), π, nothing}})
@ Molly ~/projects/CESMIX/Molly.jl/src/neighbors.jl:115
[10] System(coord_file::String, force_field::OpenMMForceField{Float64, Quantity{Float64, π, Unitful.FreeUnits{(u,), π, nothing}}, Quantity{Float64, π, Unitful.FreeUnits{(nm,), π, nothing}}, Quantity{Float64, π^2 π π^-1 π^-2, Unitful.FreeUnits{(kJ, mol^-1), π^2 π π^-1 π^-2, nothing}}, Quantity{Float64, π π^-1 π^-2, Unitful.FreeUnits{(kJ, nm^-2, mol^-1), π π^-1 π^-2, nothing}}}; velocities::CuArray{SVector{3, Quantity{Float64, π π^-1, Unitful.FreeUnits{(nm, ps^-1), π π^-1, nothing}}}, 1, CUDA.Mem.DeviceBuffer}, boundary::Nothing, loggers::Tuple{}, units::Bool, gpu::Bool, gpu_diff_safe::Bool, dist_cutoff::Quantity{Float64, π, Unitful.FreeUnits{(nm,), π, nothing}}, dist_neighbors::Quantity{Float64, π, Unitful.FreeUnits{(nm,), π, nothing}}, implicit_solvent::Nothing, center_coords::Bool, rename_terminal_res::Bool, kappa::Quantity{Float64, π^-1, Unitful.FreeUnits{(nm^-1,), π^-1, nothing}})
@ Molly ~/projects/CESMIX/Molly.jl/src/setup.jl:773
[11] macro expansion
@ ~/projects/CESMIX/Molly.jl/test/protein.jl:164 [inlined]
[12] macro expansion
@ ~/builds/julia-1.8.1/share/julia/stdlib/v1.8/Test/src/Test.jl:1357 [inlined]
[13] top-level scope
@ ~/projects/CESMIX/Molly.jl/test/protein.jl:58
[14] include(fname::String)
@ Base.MainInclude ./client.jl:476
[15] top-level scope
@ ~/projects/CESMIX/Molly.jl/test/runtests.jl:78
[16] include(fname::String)
@ Base.MainInclude ./client.jl:476
[17] top-level scope
@ none:6
[18] eval
@ ./boot.jl:368 [inlined]
[19] exec_options(opts::Base.JLOptions)
@ Base ./client.jl:276
[20] _start()
@ Base ./client.jl:522
Test Summary: | Pass Error Total Time
OpenMM protein comparison | 26 1 27 4m57.8s
ERROR: LoadError: Some tests did not pass: 26 passed, 0 failed, 1 errored, 0 broken.
in expression starting at /home/leios/projects/CESMIX/Molly.jl/test/protein.jl:57
in expression starting at /home/leios/projects/CESMIX/Molly.jl/test/runtests.jl:77
ERROR: Package Molly errored during testing
It seems to have flooded the available memory pool and cannot allocate more space. This could be a garbage collection issue where the tests run fine independently, but fail when put together because the memory has not been properly deallocated?
Re-opening this issue because it's another error on protein.jl.
How much memory are we asking the users to have for this test? Maybe we should just check the available memory and cancel the test for certain GPUs.
GPU memory usage on master is very poor and I honestly don't know the range of hardware it will work on. I'm hoping to switch to the kernel setup within the next couple of months though, which should be much better. At that point it is probably worth doing a survey across different hardware and addressing any issues.
Yeah, that's fair. I'll just quietly comment out the test for now for #99
GPU memory usage should be much improved in v0.15.0.
Testing and benchmarking on different GPUs is on the todo list.
I could not get the tests to work on my GTX970 GPU. Seems like there is an issue with
Specifically when called in
test/protein.jl
I think
permuteddims
doesn't work on a CuArray, so I tried keeping it as an array, but eventually ran into an issue with turningis
into an array for theDistanceVecNeighborFinder
. I tried a bunch of different variations, so I'll just leave the unchanged error here:This could be related to #16 , but I felt it was different enough to warrant a separate issue.