ESCOMP / CAM

Community Atmosphere Model
74 stars 136 forks source link

mem leak in micro_pumas_cam_tend #1136

Open jedwards4b opened 2 weeks ago

jedwards4b commented 2 weeks ago

What happened?

We have detected a memory leak in the micro_pumas_cam_tend subroutine. I am running case SMS_Ly1.ne30pg3_ne30pg3_mg17.FLTHIST.derecho_intel, I have also found the same issue with the gnu compiler - I believe that this confirms there is a code problem and not a compiler problem. In /glade/derecho/scratch/jedwards/SMS_Ly1.ne30pg3_ne30pg3_mg17.FLTHIST.derecho_intel.20240826_071726_zi82rc/SourceMods/src.cam/micro_pumas_cam.F90 I have added some code that prints the rss and other memory stats at various points in the subroutine. It is seeing an increase in rss between the

dec2051.hsn.de.hpc.ucar.edu 0: micro_pumas_cam_tend start sysmem size=2192.1 MB rss=896.8 MB share=113.7 MB text=53.5 MB datastack=0.0 MB dec2051.hsn.de.hpc.ucar.edu 0: micro_pumas_cam_tend 1 sysmem size=2192.1 MB rss=897.0 MB share=113.7 MB text=53.5 MB datastack=0.0 MB

and also between :

dec2051.hsn.de.hpc.ucar.edu 0: micro_pumas_cam_tend 12 sysmem size=2204.1 MB rss=904.8 MB share=113.8 MB text=53.5 MB datastack=0.0 MB dec2051.hsn.de.hpc.ucar.edu 0: micro_pumas_cam_tend end sysmem size=2200.1 MB rss=904.9 MB share=113.8 MB text=53.5 MB datastack=0.0 MB

What are the steps to reproduce the bug?

Test SMS_Ly1.ne30pg3_ne30pg3_mg17.FLTHIST.derecho_intel shows memory usage growing over time in the med.log file.

Using the modified code in the case outlined above I have traced this memory growth to the micro_pumas_cam_tend subroutine.

Testing using the gnu compiler indicates that this issue may be unique to the intel/2023.2.1 compiler.

There is a newer compiler available on derecho in intel/2024.2.1, this would also be a good time to switch from the ifort to the new ifx compiler. Testing intel-oneapi/2024.2.1 I still see indication of a memory leak.

Here are the memory rss values for each compiler on model date 1979-01-10T00:00:00: Intel/2023.2.1 : 896.75 MB intel-oneapi/2024.2.1: 893.57 MB gcc/12.2.0 : 831.64 MB

The memusage at the end of 12 months: intel/2023.2.1 : 1506.4 MB intel-oneapi/2024.2.1: 1231.16 MB gcc/12.2.0 : 1060.24 MB

Note that while both the intel compilers have a continuous memory increase, the gnu compiler stabilizes and does not continue to increase memory usage.

What CAM tag were you using?

cam6_4_016

What machine were you running CAM on?

CISL machine (e.g. cheyenne)

What compiler were you using?

Intel

Path to a case directory, if applicable

/glade/derecho/scratch/jedwards/SMS_Ly1.ne30pg3_ne30pg3_mg17.FLTHIST.derecho_intel.20240826_071726_zi82rc

Will you be addressing this bug yourself?

No

Extra info

No response

jedwards4b commented 2 weeks ago

I have tried reducing optimization of the file micro_pumas_cam.F90 to -O0, it did not appear to affect the memory leak.

adamrher commented 2 weeks ago

It might be worth running a FHIST compset to see if the leak is still there, as it compiles a different pumas driver than used in FLTHIST (src/physics/pumas-frozen/ instead of src/physics/pumas/).

jedwards4b commented 2 weeks ago

@adamrher You are correct that does not show the memory leak.