SPECFEM / specfem3d

SPECFEM3D_Cartesian simulates acoustic (fluid), elastic (solid), coupled acoustic/elastic, poroelastic or seismic wave propagation in any type of conforming mesh of hexahedra (structured & unstructured).
https://specfem.org
GNU General Public License v3.0
416 stars 231 forks source link

🐛 [BUG] - <title> Inconsistent gradients computed using CPU and GPU #1755

Closed jlulh closed 1 week ago

jlulh commented 1 month ago

Description

I built the same model and inverted it using CPU and GPU respectively, and the computed kernel is very different. I tested the latest specfem3d version 4.1.1, old versions 4.1.0 and 4.0.0, and they both have problems.

Affected SPECFEM3D version

4.1.1(a5bb135), 4.1.0(89d1601) and 4.0.0(c97d521)

Your software and hardware environment

Ubuntu 22.04.4 LTS; gcc version 11.4.0; MPICH Version: 4.0; cpu: AMD EPYC 9684X; GPU: RTX4090;

Reproduction steps

I used seisflow and specfem3d for the inversion test, and I found that the adjoint sources computed by the CPU and GPU are the same, but the kernel output by xspecfem3d is very different.

Screenshots

No response

Logs

No response

OS

No response

danielpeter commented 1 month ago

interesting - is this an issue with SPECFEM or seisflows?

maybe you could provide a small SPECFEM example setup where you see different kernel values between CPU and GPU simulations. this would help to reproduce your issue.

jlulh commented 4 weeks ago

Hi Daniel,

I hope this message finds you well. I’m glad to hear from you and apologize for my delayed response.

I have uploaded the specfem3D package I’ve been using to GitHub: https://github.com/jlulh/Specfem3d_test/. This version is based on the devel branch (a5bb135), and I made a few minor modifications to the following functions: compute_arrays_source.f90, write_output_SU.f90, compute_kernels.f90, and compute_kernels_hess_el_cudakernel.cu.

Additionally, I have included an example (model0050_test) that I used for testing. The MESH, as well as the true and initial model files, were all generated using xmeshfem3D. I tested the kernel of a shot dataset located in the model0050_test/scratch/solver/000000/ folder. You can modify the model0050_test/scratch/solver/000000/DATA/Par_file to set GPU_MODE=true or false, and then run the simulation with the command mpirun -np 1 ./bin/xspecfem3D. You will notice that the output files in OUTPUT_FILES/DATABASES_MPI have inconsistent *_kernel.bin results.

Please let me know if you have any questions or need further clarification.

Best regards

danielpeter commented 2 weeks ago

thanks for pointing out this inconsistency! there was indeed some differences between CPU and GPU versions in how the sources have been applied in your coupled-domain setup. PR #1759 should address and fix these.

I noted that you modified the SU adjoint source reading. in the PR, I incorporated a similar fix to be able to run the kernels with only the elastic adjoint source files (0_dx_SU.adj, ..) for this coupled acoustic/elastic domain setup.

also, you seem to have modified the Hessian kernel in file compute_kernels.f90. note that the current SPECFEM3D version implements an approximate source-receiver Hessian kernel (multiplying accel() * b_accel()), as compared to your source-source Hessian modification (b_accel() * b_accel()). you would have to re-do that modification when pulling and trying out the new devel version (and the same with your SU header modification).

jlulh commented 1 week ago

Thank you for the fix! I have tested the updated version, and the issue is resolved. I appreciate your help and efforts.