SPECFEM / specfem3d

SPECFEM3D_Cartesian simulates acoustic (fluid), elastic (solid), coupled acoustic/elastic, poroelastic or seismic wave propagation in any type of conforming mesh of hexahedra (structured & unstructured).
https://specfem.org
GNU General Public License v3.0
415 stars 231 forks source link

CUDA memory error! #1344

Open payoubi opened 5 years ago

payoubi commented 5 years ago

Hi,

I've successfully compiled specfem3d with cuda. When I run small models, it works fine, but for bigger models (1 million elements for example), it returns CUDA error !!!!! <out of memory> !!!!! at CUDA call error code: # 1001

while the gpu information says

rank 0: GPU memory usage: used = 1086.875000 MB, free = 906.875000 MB, total = 1993.750000 MB

The gpu version is "Quadro K620".

Do you have any idea what might cause the problem?

``

danielpeter commented 5 years ago

hi,

could you attach the output_solver.txt file for this issue?

there might be multiple mpi processes running on the same card or some other visualization applications occupy memory if you use your quadro card also for your display. you check that with the nvidia-smi command.

best, daniel

On Aug 23, 2019, at 08:41, payoubi notifications@github.com wrote:

Hi,

I've successfully compiled specfem3d with cuda. When I run small models, it works fine, but for bigger models (1 million elements for example), it returns CUDA error !!!!! !!!!! at CUDA call error code: # 1001

while the gpu information says

rank 0: GPU memory usage: used = 1086.875000 MB, free = 906.875000 MB, total = 1993.750000 MB

The gpu version is "Quadro K620".

Do you have any idea what might cause the problem?

``

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

payoubi commented 5 years ago

This is the usage of the graphic card. It's being used by others but it's not that much, the summation would not be the capacity of graphic card,

`Fri Aug 23 09:56:55 2019
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Quadro K620 On | 00000000:03:00.0 On | N/A | | 37% 51C P0 2W / 30W | 681MiB / 1993MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 16355 G /usr/lib/xorg/Xorg 20MiB | | 0 16588 G /usr/lib/xorg/Xorg 230MiB | | 0 17489 G /usr/bin/gnome-shell 299MiB | | 0 17747 G ...quest-channel-token=8468225398656622040 125MiB | +-----------------------------------------------------------------------------+ `

I've also attached the solver output. output_solver.txt

payoubi commented 5 years ago

And the error message is: error_message_000000.txt

iamagoofymonkey commented 5 years ago

Hi payoubi,

It might be useful if you run the following command while you are running specfem3d: $> watch -d -n 0.1 nvidia-smi

This will run nvidia-smi every .1 of second and highlight the changes. This will hopefully provide some insight in to how much memory is free and being allocated while running specfem3d.

I think it is also possible that you may get a memory error if there is enough total free memory but not enough contiguous memory (memory fragmentation).

-Thomas

On Fri, Aug 23, 2019 at 7:05 PM payoubi notifications@github.com wrote:

And the error message is: error_message_000000.txt https://github.com/geodynamics/specfem3d/files/3535483/error_message_000000.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/geodynamics/specfem3d/issues/1344?email_source=notifications&email_token=ABYTWY4HYOC5KM4PDH7EDQ3QGAKE7A5CNFSM4IO4GI5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5AZCNA#issuecomment-524390708, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYTWY2GS3HFXK7XQKLFUVDQGAKE7ANCNFSM4IO4GI5A .

jucguerraos commented 4 years ago

Hi,

I have the same problem. I am ussing a NVIDIA GTX 1060 - 6 Gb

danielpeter commented 4 years ago

there is not much one can do other than splitting up a bigger simulation onto multiple GPUs. check your output_solver.txt for the lines (like the one provided above):

 preparing fields and constants on GPU devices

   minimum memory requested     :    11886.109104156494      MB per process

this is an estimation of memory needed on your GPU (per MPI process). the rest is up to you how to setup your simulation, i.e., how many MPI processes you want on how many GPUs etc.

thus, the GPU memory pretty much determines the resolution limit of your simulations. but if you want to/must go with a specific resolution beyond that, you need to make the physics cheaper: turn off attenuation, use acoustic rather than elastic and you'll need less GPU memory...

jucguerraos commented 4 years ago

Yes, I had to reduce only the number of spectral elements (176 to 96) and it was possible to run it with CPU, because the 6 Gb of GPU memory was not enough.

HakunanMatatat commented 9 months ago

CUDA error !!!!! !!!!! at CUDA call error code: # 2402 Abort(1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1 Abort(1) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3 Abort(1) on node 4 (rank 4 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 4 Abort(1) on node 5 (rank 5 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5 Abort(1) on node 6 (rank 6 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 6

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4060        On  | 00000000:01:00.0  On |                  N/A |
|  0%   39C    P5              N/A / 115W |    736MiB /  8188MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

hi, I have the same problem.

danielpeter commented 9 months ago

you're running out of memory on your GPU card. that is, the simulation is too big to fit onto your GPU.