Closed dirypan closed 3 weeks ago
Do you have any limitation on the resources that can be launched in the server GPU when running a program?
Clearly it looks like you cannot use that much memory although a medium of 1056=300 shouldn't be too big.
Can you output memory after trying to load the code?
Maybe instead of evolving directly you can just upload to platform the code and see how many resources is taking.
#Code....
#...
# com = Community(....)
println(CUDA.memory_status()) #Check memory
loadToPlatform!(com)
println(CUDA.memory_status()) #Check memory
Thank you for the help.
Here are the results from the server if I print the CUDA.memory_status()
Effective GPU memory usage: 0.53% (428.625 MiB/79.151 GiB)
Memory pool usage: 0 bytes (0 bytes reserved)
nothing
Effective GPU memory usage: 0.67% (544.625 MiB/79.151 GiB)
Memory pool usage: 1.446 MiB (32.000 MiB reserved)
nothing
And it has exactly the same error if I try it with evolve
.
On my own computer, it prints out
Effective GPU memory usage: 50.89% (11.444 GiB/22.488 GiB)
Memory pool usage: 2.803 GiB (10.188 GiB reserved)
nothing
Effective GPU memory usage: 50.89% (11.444 GiB/22.488 GiB)
Memory pool usage: 2.803 GiB (10.188 GiB reserved)
nothing
And it will run smoothly. All other CUDA info is the same as before. I guess it is not a memory constraint since now on the server with CUDA.versioninfo()
it prints
1 device:
0: NVIDIA A100-SXM4-80GB (sm_80, 78.619 GiB / 80.000 GiB available)
I should have a total of 80G memory to use on the server.
The only thing that comes into my mind now is that the server has a limitation in the memory that you can send to it.
To make sure that it is not anything related with the package, could you try to create CUDA
arrays of different sizes and see if the server accepts them?
println(CUDA.memory_status()) #Check memory
CUDA.zeros(100,6,5)
println(CUDA.memory_status()) #Check memory
and check up to what declaration size it raises the error. I would advise that you do it in the server with the REPL so the second print has time to show the actual info as if you see, the memory has not increased after uploading the Community
object. That is weird, maybe it is simply that the info has not been updated by the time of execution of the second memory usage call.
The reason it seems memory usage has not increased might just be zeros[100,6,5]
is too small. This is the printout if I consecutively run three different sizes. It seems it will accept them and is still way below the memory restriction. I think it might related some low-level CUDA implementation of the step!
function.
println(CUDA.memory_status()) #Check memory
CUDA.zeros(100,6,5)
println(CUDA.memory_status()) #Check memory
Effective GPU memory usage: 1.23% (996.625 MiB/79.151 GiB)
Memory pool usage: 11.719 KiB (32.000 MiB reserved)
nothing
Effective GPU memory usage: 1.23% (996.625 MiB/79.151 GiB)
Memory pool usage: 23.438 KiB (32.000 MiB reserved)
nothing
println(CUDA.memory_status()) #Check memory
CUDA.zeros(100,60,50)
println(CUDA.memory_status()) #Check memory
Effective GPU memory usage: 1.23% (996.625 MiB/79.151 GiB)
Memory pool usage: 23.438 KiB (32.000 MiB reserved)
nothing
Effective GPU memory usage: 1.23% (996.625 MiB/79.151 GiB)
Memory pool usage: 1.167 MiB (32.000 MiB reserved)
nothing
println(CUDA.memory_status()) #Check memory
CUDA.zeros(100,600,500)
println(CUDA.memory_status()) #Check memory
Effective GPU memory usage: 1.23% (996.625 MiB/79.151 GiB)
Memory pool usage: 1.167 MiB (32.000 MiB reserved)
nothing
Effective GPU memory usage: 1.35% (1.067 GiB/79.151 GiB)
Memory pool usage: 115.608 MiB (128.000 MiB reserved)
nothing
Mmmm,
can you try to change the integrator of the medium to something very basic like an Euler integrator? This will not solve the problem since Euler is not a good integrator, but it may give clues of where is the problem.
It seems to crash during the mediumStepDE!(community::Community)
that for the default integrator of medium calls the DifferentialEquations external package.
Could be that the solver is asking for a lot of resources in order to integrate appropriately. If that is the case, we may have to look carefully at the declaration.
I tried a few other solvers in DifferentialEquations.jl including Euler, but they all raise the same error. I also found that the threshold for the error seems to be NMedium[1]*NMedium[2]*NMedium[3]>256
. At least for NMedium = [3,3,28]
it will run but not NMedium=[3,3,29]
.
What's more, I found that if I decrease the dimension to 2, it will have no problem with much larger medium grid sizes (say 2000 by 2000), so the problem might be related to the third dimension declaration.
Specifically, as long as I remove any diffusion on any dimension ( say @∂2(3,L)
) term from the mediumODE
everything works perfectly. But if I have all three terms together it will fail.
Okay, so the problem seems to be the @∂2(3,L)
operator. This operator is simply a wrapper for some code that is substituted to put the diffusion term, so you can write your own discretization accessing the positions in the matrix yourself. That is, intrinsically what the operator is doing is putting the following code:
@∂2(1,L) = (L[i1_+1,i2_,i3_]-2L[i1_+1,i2_,i3_]+L[i1_+1,i2_,i3_])/(dx^2)
So maybe there is a bug in this part. Can you try to write your own discretization and check if that solves the problem? If that is the case, there may be a problem in the operator declaration.
It is still weird that the problems jumps with system size and not every time.
Also, have you tried with CPU instead and it works?
So I replaced the wrapper with the following code
dt(L) = DL*( (L[i1_+1,i2_,i3_]-2*L[i1_,i2_,i3_]+L[i1_-1,i2_,i3_])/(dx^2) +
(L[i1_,i2_+1,i3_]-2*L[i1_,i2_,i3_]+L[i1_,i2_-1,i3_])/(dy^2)+
(L[i1_,i2_,i3_+1]-2*L[i1_,i2_,i3_]+L[i1_,i2_,i3_-1])/(dz^2)) - rL*L
It raises the same problem above the mentioned threshold. So the wrapper is not the problem.
It works well with the CPU, jus the speed is too slow when I want to apply it to real simulations.
What happens when you remove the diffusion in the z axis?
And what happens if you apply squared conditions?
If I remove diffusion on any axis, it will work.
simBox:x,y,z does not affect the problem as I changed them. If Nx=Ny=Nz=N, then it will work if N=6 and doesn't work with N=7 as I guess 6^3<256<7^3
But you keep having the problem if you keep the diffusion in the x and y axis but remove it in the z axis?
Or as far as you remove diffusion in a specific axis, it works?
As far as keeping any two axis, it does not matter which one I remove, it will work. It will only have problem if I keep all three.
But you keep having the problem if you keep the diffusion in the x and y axis but remove it in the z axis?
Or as far as you remove diffusion in a specific axis, it works?
Okay, this is very weird because is a bug that does not happen in every platform, nor in every GPU, nor in every situation.
I will need some days to try to figure it out. I will keep you posted.
I have found a workaround. I defined all parameters and variables in Float32 and define their values by Float32 values. This does eliminate the error. I have a few questions on this:
Most GPUs work with Float32 only and usually give an error. Only Quadro and other high-performance GPUs work with Float64.
I guess this is the problem since A100 can do Float64 and my 3090ti can't.
Only Quadro and other high-performance GPUs work with Float64.
Hi I am trying to simulate a model with GPU on a server, here is the simulation code:
Here is the error:
If I switch the
NMedium=[10,5,6]
toNMedium = [10, 4, 6]
this code will run without problem. (I didn't add an agent here because I found that changing the agent number has no effect on this error)Here is the
CUDA.versioninfo()
Interestingly, the code works well on my own computer with the
CUDA.versioninfo()
Can you help with this problem? What could be the problem with the server or the code?