ORNL-Fusion / aorsa

All ORders Spectral Algorithm (The Original)
MIT License
11 stars 2 forks source link

Memory creep of AORSA on Cori #47

Closed AreWeDreaming closed 1 year ago

AreWeDreaming commented 2 years ago

Dear AORSA community, Over the last 2 years I have noticed an enormous increase in AORSA's memory requirements. My first calculations used 128 Cori nodes to handle calculations grids with a resolution of up to 384x384 with nzeta_wdote=51, and nzeta_wdoti=0. While the source code has not changed during this time, such calculations will not run today, and instead will terminate with a bus error during the computation of the electron QL operator. Last week I tried to run a calculation with the same wdot settings, and a grid of 320x256 on 512 nodes. For large mode numbers 89-91 this worked, but anything below that crashes again at the same point. Increasing the amount nodes any further will have diminishing returns, and experimenting with it is extremely costly as AORSA only crashes towards the end of the calculation. Is there anything that can be done to figure out what has happened to AORSA over these 2 years? Is this something people at NERSC could help us with? We are getting towards the helicon experiments, and AORSA is a vital arrow in our quiver. In its current state, it is prohibitively expensive exclusively due to its memory requirements, especially for toroidally resolved calculations which will be needed for many physics validation studies. Please help. batchscript.txt aorsa2d.txt

ntsujii commented 2 years ago

I had also encountered a similar problem with the present version in the post-processing, which I have not had time to look into (I am actually using an old version right now). I am not sure the problem is the same as yours, but the bug may have existed for a while and surfaced only recently due to compiler updates on the NERSC platform.

I think 128 nodes are enough to handle 320x256. Since the program is already in the post-processing part, I doubt that increasing the number of nodes would be helpful. Can you reproduce the problem with smaller problem size? This would make debugging easier.

If you are in a hurry, you could turn off wdot calculation altogether, assuming what you want most is synthetic PCI.

AreWeDreaming commented 2 years ago

On second thought, I am not even sure if this is truly a memory issue, AORSA dies with a Bus error which can be indicative of a memory issue. However, the MaxRSS size is not too large:

Failed job, m = 88
JobID            MaxRSS 
------------ ---------- 
60355861                
60355861.ba+      0.02G 
60355861.ex+      0.00G 
60355861.0        2.17G 

Successful job, m = 89

JobID            MaxRSS 
------------ ---------- 
60355842                
60355842.ba+      0.02G 
60355842.ex+      0.00G 
60355842.0        2.43G 

This MaxRSS is about half of the 4 GB theoretically available for each job. I also created a 170x170 debug job that I ran on just 4 nodes:

JobID            MaxRSS 
------------ ---------- 
60599755                
60599755.ba+      0.02G 
60599755.ex+      0.00G 
60599755.0        2.69G

This has a higher RSS than any of the jobs above and works just fine. I then put in a stop statement at the point where my large job crashes to check the RSS up to that point.

JobID            MaxRSS 
------------ ---------- 
60601180                
60601180.ba+      0.02G 
60601180.ex+      0.00G 
60601180.0        2.53G  

So the steps beyond the crashing point just add a measly 0.04 GB, which matches my expectation of the allocation calls in aorsa2dMain.F in the lines 7869-7891. The code that crashes is somewhere in between lines 7774-8817 in aorsa2dMain.F. I found an MKL call to blacs_barrier and a SCALAPACK call dgsum2d in ql_myra.f. Besides that, the only other potential problem points are besjc and z_approx in wdot_sum.f90 which I could not immediately track down in the source code. I'll add some more print statements to this code section and then rerun the failing run. That'll help us track down the issue a lot better.

AreWeDreaming commented 2 years ago

It seems that AORSA dies inside ql_myra.f, probably somewhere after line 251. Please take this with a large grain of salt because this is when the main MPI task dies, not where the offending MPI task dies. I don't have enough experience/ or any really with MPI, so I don't know how to debug this further. My attempts to produce a failing run that is smaller have been unsuccessful and without a better strategy to diagnose the issue, I cannot in good conscience spend the node hours it takes to get to the point of the crash. Calling on @dlg0 and @jcwright77 for help.

P.S. I tried compiling AORSA on Perlmutter, since NERSC is offering free CPU hours at the moment, but the amount of build errors is crushing. As you can see per the new branch Permutter_compile I attempted to work through them, but I am now stuck. If there is any interest in this I'll open a separate issue for that.

dlg0 commented 2 years ago

My suggestion would be to run the smallest offending case in Arm DDT (a useful debugger) to try to track down the actual location of the error. Documentation can be found here ... https://docs.nersc.gov/tools/debug/ddt/

dlg0 commented 2 years ago

Additionally, the NERSC support staff really are very good at helping to track down issues at scale. I'd also suggest creating a NERSC ticket which explains exactly how to reproduce the issue, and ask that they look into it.

Certainly I've always wanted to enable building and running with the various bounds checking enabled. Perhaps you can enable it on a single file (the one you suspect).

There is also the new -fsanitize flags to GNU which I think work with FORTRAN and you can try various and specific flags if out of bounds or other memory issues are suspect.

AreWeDreaming commented 2 years ago

I finally managed to finish my calculations. I avoided this issue by significantly reducing the number of nodes allocated to AORSA. The problem with a 320x256 grid can be easily run on just 64 nodes. Notably, the scaling of AORSA is also very poor in this case. The cases that ran successfully on 512 nodes took 18 minutes and the cases running on 64 nodes just needed 37 minutes. This might look obvious now, but I was under the impression that I could never run AORSA on so few nodes due to memory limitations. Addendum: I could not debug AORSA because there is a routine in orbit.f appropriately called ERROR that throws to runtime checking exceptions when I try to debug. In my attempt to fix this in the branch intel_debug I introduced a bug that would cause AORSA to crash right on startup. I might have fixed that with 4c20bbfc51080ede0e23e0595e90115a820fb980, but I never got to test that.

AreWeDreaming commented 1 year ago

Closing because cori reached end of life.