NOAA-OWP / LGAR-C

Lumped Arid/Semi-arid Model (LASAM) simulates infiltration and surface runoff (two important components of the hydrologic cycle) based on Layered Green-Ampt with redistribution (LGAR) model
Other
1 stars 3 forks source link

Memory Leak Fixes #28

Closed hellkite500 closed 2 months ago

hellkite500 commented 3 months ago

Serious memory leaks exist in this implementation, as noted in #27. Running through the llvm address-sanitizer with leak detection, the total memory leaked by the bmi test using

lasam_standalone ./configs/config_lasam_Bushland.txt

is nearly 30 MB!

SUMMARY: AddressSanitizer: 26907277 byte(s) leaked in 512522 allocation(s).

This happens predominantly because the linked list data structure supporting the layered wetting fronts dynamically allocates list elements using malloc, but these are never freed/deleted. These lists are then copied using listCopy, between each time step to track previous state and current state, but once copied, the old memory is inaccessible and "leaks". This is extremely problematic when used in the NextGen model engine (ngen) since we potentially initialize over 800k models, and then run each for a period of time ranging from hours to decades! These memory leaks at that scale quickly exhaust system resources (as evident by the 10 day CONUS tests where #27 was first identified!).

Back of the envelop math implies this could leak up to 24 TB!!!! of memory in only a year long simulation (the length of time of the simulation used in the test above)!

This PR address the bulk of the memory leaks, especially those occurring in the BMI implementation. With these changes, the memory leaks are now minimal and seem to be related mostly to the use of fprintf and scanf and shouldn't be leaking memory for every simulation time step!

SUMMARY: AddressSanitizer: 949 byte(s) leaked in 11 allocation(s).

Care should be taken in future developments and work with this customized linked list to ensure that dynamic memory allocations are properly freed. This PR also shows where other model states and temporaries have dynamic memory allocation and had no corresponding cleanup code. Particularly with dynamic allocation in loops (such as in Update which had nested loops in a function intended to be called in a loop) dynamic memory use must be handled correctly.

Additions

Changes

Testing

  1. Ran the standalone code with address sanitizer to confirm fixed memory leaks

Notes

Todos

Checklist

Target Environment support

Merging this PR closes #27.

stcui007 commented 3 months ago

I just cloned the PR code and did a run using Bushland and Phillipsburg initial configs. Valgrind report for Bushland:

==106320== LEAK SUMMARY:
==106320==    definitely lost: 0 bytes in 0 blocks
==106320==    indirectly lost: 0 bytes in 0 blocks
==106320==      possibly lost: 0 bytes in 0 blocks
==106320==    still reachable: 1,824 bytes in 27 blocks
==106320==         suppressed: 0 bytes in 0 blocks

For Phillipsburg:

==107020== LEAK SUMMARY:
==107020==    definitely lost: 6,701,496 bytes in 279,229 blocks
==107020==    indirectly lost: 0 bytes in 0 blocks
==107020==      possibly lost: 48 bytes in 2 blocks
==107020==    still reachable: 2,424 bytes in 52 blocks
==107020==         suppressed: 0 bytes in 0 blocks

Don't understand why the initial config causes the difference.

stcui007 commented 3 months ago

I put the full report here for convenience:

 valgrind --leak-check=full ./build/lasam_standalone configs/config_lasam_Phillipsburg.txt
==110617== Memcheck, a memory error detector
==110617== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==110617== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==110617== Command: ./build/lasam_standalone configs/config_lasam_Phillipsburg.txt
==110617==
==110617== Conditional jump or move depends on uninitialised value(s)
==110617==    at 0x5E2E852: __ieee754_pow_sse2 (e_pow.c:75)
==110617==    by 0x5E403EB: pow (w_pow.c:27)
==110617==    by 0x4172E0: calc_theta_from_h(double, double, double, double, double, double) (soil_funcs.cxx:149)
==110617==    by 0x41530B: lgar_read_vG_param_file(char const*, int, double, soil_properties_*) (lgar.cxx:2299)
==110617==    by 0x40EC95: InitFromConfigFile(std::string, model_state*) (lgar.cxx:542)
==110617==    by 0x40C935: lgar_initialize(std::string, model_state*) (lgar.cxx:77)
==110617==    by 0x40748C: BmiLGAR::Initialize(std::string) (bmi_lgar.cxx:39)
==110617==    by 0x4038B6: main (bmi_main_lgar.cxx:50)
==110617==
==110617== Conditional jump or move depends on uninitialised value(s)
==110617==    at 0x5E2E88A: __ieee754_pow_sse2 (e_pow.c:86)
==110617==    by 0x5E403EB: pow (w_pow.c:27)
==110617==    by 0x4172E0: calc_theta_from_h(double, double, double, double, double, double) (soil_funcs.cxx:149)
==110617==    by 0x41530B: lgar_read_vG_param_file(char const*, int, double, soil_properties_*) (lgar.cxx:2299)
==110617==    by 0x40EC95: InitFromConfigFile(std::string, model_state*) (lgar.cxx:542)
==110617==    by 0x40C935: lgar_initialize(std::string, model_state*) (lgar.cxx:77)
==110617==    by 0x40748C: BmiLGAR::Initialize(std::string) (bmi_lgar.cxx:39)
==110617==    by 0x4038B6: main (bmi_main_lgar.cxx:50)
==110617==
==110617== Conditional jump or move depends on uninitialised value(s)
==110617==    at 0x5E2E8B0: __ieee754_pow_sse2 (e_pow.c:95)
==110617==    by 0x5E403EB: pow (w_pow.c:27)
==110617==    by 0x4172E0: calc_theta_from_h(double, double, double, double, double, double) (soil_funcs.cxx:149)
==110617==    by 0x41530B: lgar_read_vG_param_file(char const*, int, double, soil_properties_*) (lgar.cxx:2299)
==110617==    by 0x40EC95: InitFromConfigFile(std::string, model_state*) (lgar.cxx:542)
==110617==    by 0x40C935: lgar_initialize(std::string, model_state*) (lgar.cxx:77)
==110617==    by 0x40748C: BmiLGAR::Initialize(std::string) (bmi_lgar.cxx:39)
==110617==    by 0x4038B6: main (bmi_main_lgar.cxx:50)
==110617==
==110617== Conditional jump or move depends on uninitialised value(s)
==110617==    at 0x5E2E8C7: __ieee754_pow_sse2 (e_pow.c:95)
==110617==    by 0x5E403EB: pow (w_pow.c:27)
==110617==    by 0x4172E0: calc_theta_from_h(double, double, double, double, double, double) (soil_funcs.cxx:149)
==110617==    by 0x41530B: lgar_read_vG_param_file(char const*, int, double, soil_properties_*) (lgar.cxx:2299)
==110617==    by 0x40EC95: InitFromConfigFile(std::string, model_state*) (lgar.cxx:542)
==110617==    by 0x40C935: lgar_initialize(std::string, model_state*) (lgar.cxx:77)
==110617==    by 0x40748C: BmiLGAR::Initialize(std::string) (bmi_lgar.cxx:39)
==110617==    by 0x4038B6: main (bmi_main_lgar.cxx:50)
==110617==
==110617== Conditional jump or move depends on uninitialised value(s)
==110617==    at 0x5E2ED91: __ieee754_pow_sse2 (e_pow.c:96)
==110617==    by 0x5E403EB: pow (w_pow.c:27)
==110617==    by 0x4172E0: calc_theta_from_h(double, double, double, double, double, double) (soil_funcs.cxx:149)
==110617==    by 0x41530B: lgar_read_vG_param_file(char const*, int, double, soil_properties_*) (lgar.cxx:2299)
==110617==    by 0x40EC95: InitFromConfigFile(std::string, model_state*) (lgar.cxx:542)
==110617==    by 0x40C935: lgar_initialize(std::string, model_state*) (lgar.cxx:77)
==110617==    by 0x40748C: BmiLGAR::Initialize(std::string) (bmi_lgar.cxx:39)
==110617==    by 0x4038B6: main (bmi_main_lgar.cxx:50)
==110617==

*********************************************************
-------------------- Simulation Summary -----------------
------------------------ Mass balance -------------------
Initial water in soil     =  45.1158503556 cm
Total precipitation       =  99.0854000000 cm
Total infiltration        =  88.8044496039 cm
Final water in soil       =  45.6315405858 cm
Surface ponded water      =   0.0000000000 cm
Surface runoff            =  10.2809503961 cm
GIUH runoff               =  10.2809503961 cm
Total percolation         =   0.0000000000 cm
Total AET                 =  88.2887593432 cm
Total PET                 = 147.9532940266 cm
Total discharge (Q)       =  10.2809503961 cm
Vol change (calibration)  =   0.0000000000 cm
Global balance            =   3.057077e-08 cm

*********************************************************
-------------------- Simulation Summary -----------------
------------------------ Mass balance -------------------
Initial water in soil     =  45.1158503556 cm
Total precipitation       =  99.0854000000 cm
Total infiltration        =  88.8044496039 cm
Final water in soil       =  45.6315405858 cm
Surface ponded water      =   0.0000000000 cm
Surface runoff            =  10.2809503961 cm
GIUH runoff               =  10.2809503961 cm
Total percolation         =   0.0000000000 cm
Total AET                 =  88.2887593432 cm
Total PET                 = 147.9532940266 cm
Total discharge (Q)       =  10.2809503961 cm
Vol change (calibration)  =   0.0000000000 cm
Global balance            =   3.057077e-08 cm
Time                      =   80.11 sec
==110617==
==110617== HEAP SUMMARY:
==110617==     in use at exit: 6,703,968 bytes in 279,283 blocks
==110617==   total heap usage: 797,221 allocs, 517,938 frees, 39,024,780 bytes allocated
==110617==
==110617== 24 bytes in 1 blocks are possibly lost in loss record 12 of 35
==110617==    at 0x4C29EC3: malloc (vg_replace_malloc.c:309)
==110617==    by 0x411BC6: lgar_move_wetting_fronts(double, double*, int, double, int, double*, double*, int*, double*, wetting_front**, wetting_front*, soil_properties_*) (lgar.cxx:1226)
==110617==    by 0x408932: BmiLGAR::Update() (bmi_lgar.cxx:357)
==110617==    by 0x403F9D: main (bmi_main_lgar.cxx:129)
==110617==
==110617== 24 bytes in 1 blocks are possibly lost in loss record 13 of 35
==110617==    at 0x4C29EC3: malloc (vg_replace_malloc.c:309)
==110617==    by 0x411BE4: lgar_move_wetting_fronts(double, double*, int, double, int, double*, double*, int*, double*, wetting_front**, wetting_front*, soil_properties_*) (lgar.cxx:1227)
==110617==    by 0x408932: BmiLGAR::Update() (bmi_lgar.cxx:357)
==110617==    by 0x403F9D: main (bmi_main_lgar.cxx:129)
==110617==
==110617== 5,736 bytes in 239 blocks are definitely lost in loss record 32 of 35
==110617==    at 0x4C29EC3: malloc (vg_replace_malloc.c:309)
==110617==    by 0x411BC6: lgar_move_wetting_fronts(double, double*, int, double, int, double*, double*, int*, double*, wetting_front**, wetting_front*, soil_properties_*) (lgar.cxx:1226)
==110617==    by 0x4083EA: BmiLGAR::Update() (bmi_lgar.cxx:268)
==110617==    by 0x403F9D: main (bmi_main_lgar.cxx:129)
==110617==
==110617== 5,736 bytes in 239 blocks are definitely lost in loss record 33 of 35
==110617==    at 0x4C29EC3: malloc (vg_replace_malloc.c:309)
==110617==    by 0x411BE4: lgar_move_wetting_fronts(double, double*, int, double, int, double*, double*, int*, double*, wetting_front**, wetting_front*, soil_properties_*) (lgar.cxx:1227)
==110617==    by 0x4083EA: BmiLGAR::Update() (bmi_lgar.cxx:268)
==110617==    by 0x403F9D: main (bmi_main_lgar.cxx:129)
==110617==
==110617== 3,344,688 bytes in 139,362 blocks are definitely lost in loss record 34 of 35
==110617==    at 0x4C29EC3: malloc (vg_replace_malloc.c:309)
==110617==    by 0x411BC6: lgar_move_wetting_fronts(double, double*, int, double, int, double*, double*, int*, double*, wetting_front**, wetting_front*, soil_properties_*) (lgar.cxx:1226)
==110617==    by 0x408932: BmiLGAR::Update() (bmi_lgar.cxx:357)
==110617==    by 0x403F9D: main (bmi_main_lgar.cxx:129)
==110617==
==110617== 3,345,216 bytes in 139,384 blocks are definitely lost in loss record 35 of 35
==110617==    at 0x4C29EC3: malloc (vg_replace_malloc.c:309)
==110617==    by 0x411BE4: lgar_move_wetting_fronts(double, double*, int, double, int, double*, double*, int*, double*, wetting_front**, wetting_front*, soil_properties_*) (lgar.cxx:1227)
==110617==    by 0x408932: BmiLGAR::Update() (bmi_lgar.cxx:357)
==110617==    by 0x403F9D: main (bmi_main_lgar.cxx:129)
==110617==
==110617== LEAK SUMMARY:
==110617==    definitely lost: 6,701,376 bytes in 279,224 blocks
==110617==    indirectly lost: 0 bytes in 0 blocks
==110617==      possibly lost: 48 bytes in 2 blocks
==110617==    still reachable: 2,544 bytes in 57 blocks
==110617==         suppressed: 0 bytes in 0 blocks
==110617== Reachable blocks (those to which a pointer was found) are not shown.
==110617== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==110617==
==110617== For counts of detected and suppressed errors, rerun with: -v
==110617== Use --track-origins=yes to see where uninitialised values come from
==110617== ERROR SUMMARY: 96 errors from 11 contexts (suppressed: 0 from 0)
hellkite500 commented 3 months ago

@stcui007 That config triggered a slightly different code path which had some temporary allocations that weren't free'd correctly. I ran the config and was able to fix up that leak as well in e652bb2.

stcui007 commented 3 months ago

Good work!

On Thu, Jun 13, 2024 at 9:41 AM Nels @.***> wrote:

@stcui007 https://github.com/stcui007 That config triggered a slightly different code path which had some temporary allocations that weren't free'd correctly. I ran the config and was able to fix up that leak as well in e652bb2 https://github.com/NOAA-OWP/LGAR-C/pull/28/commits/e652bb2582e391adee7e4147a2de6c674c664fc3 .

— Reply to this email directly, view it on GitHub https://github.com/NOAA-OWP/LGAR-C/pull/28#issuecomment-2165876628, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA4SRI5AAAMU5AVOXLW3IDZHGVRPAVCNFSM6AAAAABJHN7A4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRVHA3TMNRSHA . You are receiving this because you were mentioned.Message ID: @.***>

stcui007 commented 2 months ago

This reminds me a similar thing. It seems that in lgar.cxx, it prints our "Mass balance" info every time step. For multi-processing jobs it does this for every process. This becomes an issue for large scale processing, on CONUS, for example, not only slows down the computation but also write out a big file if you use nohup. Just wonder if this part can be optionally made inactive when running in the framework.

On Wed, Jul 3, 2024 at 2:13 PM PeterLaFollette-NOAA < @.***> wrote:

@.**** commented on this pull request.

Awesome work everyone! Great that the memory leak issues have been so thoroughly identified / addressed. In future developments, we'll make sure we free dynamically allocated linked lists.

I've tested the changed code in standalone and in ngen, running multiple catchments from the same realization, and for the most part things seem to be good. The one change I would suggest is removing global_mass_balance from bmi_main_lgar.cxx because it is also called in Finalize (see my comment), so the simulation results are currently printed twice. Please also note that there has been an accepted PR since this one, so I think rebasing might be necessary. Aside from these minor things, this looks great.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-OWP/LGAR-C/pull/28#pullrequestreview-2157179941, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA4SRPWZBRQZBMCYO53UW3ZKRENNAVCNFSM6AAAAABJHN7A4GVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCNJXGE3TSOJUGE . You are receiving this because you were mentioned.Message ID: @.***>

peterlafollette commented 2 months ago

I'm wondering, what is the verbosity set to in the lgar configs you're using? The options are:

verbosity=none verbosity=low verbosity=high

and if verbosity is set to none then there should only be printing when the model completes (and not for every time step). The verbosity is set to none in the Phillipsburg and Bushland examples in the configs folder.

On Wed, Jul 3, 2024 at 1:01 PM Shengting Cui @.***> wrote:

This reminds me a similar thing. It seems that in lgar.cxx, it prints our "Mass balance" info every time step. For multi-processing jobs it does this for every process. This becomes an issue for large scale processing, on CONUS, for example, not only slows down the computation but also write out a big file if you use nohup. Just wonder if this part can be optionally made inactive when running in the framework.

On Wed, Jul 3, 2024 at 2:13 PM PeterLaFollette-NOAA < @.***> wrote:

@.**** commented on this pull request.

Awesome work everyone! Great that the memory leak issues have been so thoroughly identified / addressed. In future developments, we'll make sure we free dynamically allocated linked lists.

I've tested the changed code in standalone and in ngen, running multiple catchments from the same realization, and for the most part things seem to be good. The one change I would suggest is removing global_mass_balance from bmi_main_lgar.cxx because it is also called in Finalize (see my comment), so the simulation results are currently printed twice. Please also note that there has been an accepted PR since this one, so I think rebasing might be necessary. Aside from these minor things, this looks great.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-OWP/LGAR-C/pull/28#pullrequestreview-2157179941,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACA4SRPWZBRQZBMCYO53UW3ZKRENNAVCNFSM6AAAAABJHN7A4GVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCNJXGE3TSOJUGE>

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/NOAA-OWP/LGAR-C/pull/28#issuecomment-2207128954, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOI6MQHNXH25XF7YLFAZ5HDZKRKCZAVCNFSM6AAAAABJHN7A4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBXGEZDQOJVGQ . You are receiving this because your review was requested.Message ID: @.***>

stcui007 commented 2 months ago

The verbosity is set to none. The config was generated for me by Ahmad. One example is as follows.

verbosity=none soil_params_file=inputsNL/lasam/vG_default_params.dat layer_thickness=200.0[cm] initial_psi=2000.0[cm] timestep=300[sec] endtime=1000[hr] forcing_resolution=3600[sec] ponded_depth_max=0[cm] use_closed_form_G=false layer_soil_type=4 wilting_point_psi=15495.0[cm] field_capacity_psi=340.9[cm]

If what you said is true, I should take another look before saying too much. But I am pretty sure when I commented out these print statements, the output file becomes much smaller.

On Wed, Jul 3, 2024 at 3:23 PM PeterLaFollette-NOAA < @.***> wrote:

I'm wondering, what is the verbosity set to in the lgar configs you're using? The options are:

verbosity=none verbosity=low verbosity=high

and if verbosity is set to none then there should only be printing when the model completes (and not for every time step). The verbosity is set to none in the Phillipsburg and Bushland examples in the configs folder.

On Wed, Jul 3, 2024 at 1:01 PM Shengting Cui @.***> wrote:

This reminds me a similar thing. It seems that in lgar.cxx, it prints our "Mass balance" info every time step. For multi-processing jobs it does this for every process. This becomes an issue for large scale processing, on CONUS, for example, not only slows down the computation but also write out a big file if you use nohup. Just wonder if this part can be optionally made inactive when running in the framework.

On Wed, Jul 3, 2024 at 2:13 PM PeterLaFollette-NOAA < @.***> wrote:

@.**** commented on this pull request.

Awesome work everyone! Great that the memory leak issues have been so thoroughly identified / addressed. In future developments, we'll make sure we free dynamically allocated linked lists.

I've tested the changed code in standalone and in ngen, running multiple catchments from the same realization, and for the most part things seem to be good. The one change I would suggest is removing global_mass_balance from bmi_main_lgar.cxx because it is also called in Finalize (see my comment), so the simulation results are currently printed twice. Please also note that there has been an accepted PR since this one, so I think rebasing might be necessary. Aside from these minor things, this looks great.

— Reply to this email directly, view it on GitHub < https://github.com/NOAA-OWP/LGAR-C/pull/28#pullrequestreview-2157179941>,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACA4SRPWZBRQZBMCYO53UW3ZKRENNAVCNFSM6AAAAABJHN7A4GVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCNJXGE3TSOJUGE>

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/NOAA-OWP/LGAR-C/pull/28#issuecomment-2207128954, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AOI6MQHNXH25XF7YLFAZ5HDZKRKCZAVCNFSM6AAAAABJHN7A4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBXGEZDQOJVGQ>

. You are receiving this because your review was requested.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/NOAA-OWP/LGAR-C/pull/28#issuecomment-2207182928, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA4SRODOD67WRC72TPUZKLZKRMSVAVCNFSM6AAAAABJHN7A4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBXGE4DEOJSHA . You are receiving this because you were mentioned.Message ID: @.***>

hellkite500 commented 2 months ago

This reminds me a similar thing. It seems that in lgar.cxx, it prints our "Mass balance" info every time step. For multi-processing jobs it does this for every process. This becomes an issue for large scale processing, on CONUS, for example, not only slows down the computation but also write out a big file if you use nohup. Just wonder if this part can be optionally made inactive when running in the framework.

This should be tracked in a separate issue.