GenerateJacReactantProd(): change arrays from static allocation to dynamic

obin1 commented 1 year ago

I might have a potential solution for this issue: https://github.com/KineticPreProcessor/KPP/issues/64

The segfault doesn't occur in the generation of the sparse stoichiometric matrix, but in the call to GenerateJacReactantProd(). Switching a few static arrays to be smaller, dynamic arrays (that are allocated in a heap) fixed the stack overflow problem on my Mac.

Setting ulimit -s unlimited on my Mac sets the stacksize to 65520 KB, which seems like a hard limit for my system. This isn't enough, on a Linux system I tried, small strato needed a stacksize of 486208 KB to run with #STOICMAT ON. If this is a valid fix, it might be relevant for Linux users as well as Mac users: maybe the STOICMAT option doesn't need a huge stacksize.

RolfSander commented 1 year ago

Hello Obin,

Thanks for finding the reason for the segmentation fault problem!

Static memory allocation has always been a problem in KPP: The code crashes if not enough memory has been allocated, and the code also crashes when trying to allocate more memory than available. A while ago, I opened this github issue: https://github.com/KineticPreProcessor/KPP/issues/6

So far, I haven't been brave enough to implement dynamic memory allocation. My knowledge about C is limited, and I'm afraid of creating unwanted side effects.

I'm completely in favor of introducing dynamic memory allocation. However, I think we will have to carefully check that implementation before adopting it into the official distribution:

Are all results identical?
Check if all array indices are within the allowed range.
Check for memory leaks.

I'm happy to help with these tasks, however, as mentioned above, my knowledge of C is limited :-(

obin1 commented 1 year ago

hi Rolf! I'm no C expert either, but as far as memory leaks go, I think these variable length arrays are safe and will be deallocated when leaving the scope of GenerateJacReactantProd(): https://gcc.gnu.org/onlinedocs/gcc/Variable-Length.html

So far, I have verified that this makes identical f90 code for small_strato on a Linux cluster (I don't have a good idea for how to verify this on my Mac due to the nature of the stack overflow for this setting). I think Bob @yantosca is going to check this potential fix in the KPP C-I tests

yantosca commented 1 year ago

@obin1 You are correct that 65520 KB is a hard limit for MacOS. Probably because of some BSD-specific thing.

yantosca commented 1 year ago

I ran the C-I tests on my MacBook Air and Linux Laptop. With the fixes in this PR, the mechanism no longer encounters the segfault. But there are differences e.g. in the C_rk test where the Mac test has all NaN values as opposed to the Linux test.

Output of C-I tests on Linux:

kpp-3.0.1-ci-tests-linux.txt

Output of C-I tests on MacOS X Ventura

kpp-3.0.1-ci-tests-mac.txt

Differences: Linux (<) vs Mac (>)

linux-mac-diff.txt

Let me know what you think.

RolfSander commented 1 year ago

The error message says that ‘ErrOld’ and ‘Hacc’ are uninitialized. Is it possible that the variables are initialized with zero on linux and with a random number on the mac?

yantosca commented 1 year ago

@RolfSander that could be. I'll take a look.

yantosca commented 1 year ago

@RolfSander: I tried setting those to ZERO but we still get the NaN values. The issue is in the runge_kutta.c integrator. Not sure how much time we want to spend on this, given that not too many people are using the C interface.

RolfSander commented 1 year ago

I agree. Fixing the C integrator is not top priority. It would be more interesting to check if KPP with the dynamic memory allocation still generates identical Fortran files.

obin1 commented 1 year ago

On the Linux environment I'm using, Discover, this makes identical f90 code for small_strato with STOICMAT. The two things I tried that are zero diff are 1) small_strato with limit stacksize 486208 2) small_strato with the changes in this PR

RolfSander commented 1 year ago

It's great to see that you get identical results for the small_strato mechanism with STOICMAT :-)

I will try it with my own (pretty big) chemical mechanism.

RolfSander commented 1 year ago

I'd like to come back to the question of illegal array indices. Since we make some arrays smaller, it would be good to check if our array indices are still within the proper bounds. AFAIK, there aren't any gcc built-in array bound checks at runtime. Although "-fbounds-check" sounds promising, my gcc man page says that it is only implemented for Fortran and Java :-(

On github, I found a tool called libcrunch:

https://github.com/stephenrkell/libcrunch/

Would that be suitable for us?

yantosca commented 1 year ago

Even better: https://cppcheck.sourceforge.io/. I've just downloaded this from the AUR onto my linux laptop and will try to run it on the code.

yantosca commented 1 year ago

So I was able to run cppcheck pretty easily. You run it on each file. I just did a couple of for loops and directed the output to log files for each C file:

for f in $KPP_HOME/src/*.c; do cppcheck $f > ./$(basename $f).log 2>&1; done
for f in $KPP_HOME/int/*.c; do cppcheck $f > ./$(basename $f).log 2>&1; done
for f in $KPP_HOME/util/*.c; do cppcheck $f > ./$(basename $f).log 2>&1; done

The output for runge_kutta.c was:

Checking /home/bob/work/KPP3/int/runge_kutta.c ...
/home/bob/work/KPP3/int/runge_kutta.c:1362:4: error: Array index out of bounds; 'ISTATUS' buffer size is 0 and it is accessed at offset 24. [ctuArrayIndex]
   ISTATUS[Nsol]++;
   ^
/home/bob/work/KPP3/int/runge_kutta.c:786:12: note: Calling function RK_Solve, 11th argument is uninitialized
   RK_Solve(N,H,E1,IP1,E2R,E2I,IP2,DZ1,DZ2,DZ3,ISTATUS);
           ^
/home/bob/work/KPP3/int/runge_kutta.c:1362:4: note: Using argument ISTATUS
   ISTATUS[Nsol]++;
   ^

but also some of the core files have out-of-bounds reported as well, such as in this block in code.c:

int DefineVariable( char * name, int t, int bt, int maxi, int maxj, char * comment, int attr )
{
int i;
VARIABLE * var;

  for( i = 0; i < MAX_VAR; i++ )
    if( varTable[ i ] == 0 ) break;

  if( varTable[ i ] != 0 ) {
    printf("\nVariable Table overflow");
    return -1;
  }

so because the for statement doesn ot have a bracket, it thinks that only the next statement is in the loop. Then when it exits the loop, i is MAX_VAR+1 so you are out of bounds. I might open a separate PR for this.

yantosca commented 1 year ago

Oops in the prior code, I think varTable[i] is out of bounds on purpose.

RolfSander commented 1 year ago

Hmm, I'm not sure if I understand that piece of code...

The question here seems to be if the for loop was ended with the break statement, or if it had continued till the end. Right?

Couldn't that be tested with a much simpler:

if( i == MAX_VAR ) {...

???

yantosca commented 1 year ago

I think you are right @RolfSander. I can try to make a fix for that.

RolfSander commented 1 year ago

While you're working on this: Could you mention MAX_VAR in the error message and also show its value?

KineticPreProcessor / KPP

GenerateJacReactantProd(): change arrays from static allocation to dynamic #67