firemodels / fds

Fire Dynamics Simulator
https://pages.nist.gov/fds-smv/
Other
663 stars 622 forks source link

Case with ULMAT crashes due to memory issues #11403

Closed FireResearch-BK closed 1 year ago

FireResearch-BK commented 1 year ago

We have a case in which we are trying to simulate the burning of vehicle components.

With FDS 6.7.9-933 (and the standard 6.7.9 release), the ULMAT solver and a single mesh of approx. 1.1 million cells, the case crashes after 800 seconds with a ULMAT_H_MATRIX_LUDCMP PARDISO Num Factor: The following ERROR was detected: -2 error message.

According to the Intel PARDISO user guide (page 12), error code "-2" means not enough memory.

We have restarted the case from a restart file before the crash and noticed that RAM usage for FDS was around 6.4 GB. Our workstation has 64 GB of RAM and a 64 bit OS, so there is more than enough available memory for FDS.

Further investigation determined that Intel's MKL library, of which PARDISO is a part, has two available modes of operation, namely in-core (IC) and out-of-core (OCC) for the construction and handling of the pressure solver matrix. FDS uses the default mode, which is IC and, as far as we can tell, limited to a maximum of 2000 MB per core. We assume that the increasing number of non-zero values in the matrix due to burn-away push the memory usage of the case beyond this limit, leading to a crash.

The PARDISO configuration within pres.f90 (line 3818 ff) of the FDS source code sets all flags to zero (unless those specifically altered). This also affects the corresponding IC/OCC control flag (iparm(60) according to the MKL reference manual), which is therefore locked to IC, limiting memory usage. With IPARM(60) = 1, to our interpretation, it would be possible to switch between the (faster) IC and (slower) OCC mode depending on the values of the MKL_PARDISO_OOC_MAX_CORE_SIZE and MKL_PARDISO_OOC_MAX_SWAP_SIZE environment variables. The latter may potentially solve the memory issue with large cases and ULMAT.

Is there a chance to set IPARM(60) = 1 by default in future FDS releases and/or increase the default maximum memory for IC mode (if possible)?

rmcdermo commented 1 year ago

Is it possible to give us a case to test? Even if you can generate the single mesh case that fails, it would help make sure we are on the same page. Thanks.

marcosvanella commented 1 year ago

@FireResearch-BK this is an interesting suggestion. We can try this IPARM[60] as a way to alleviate the maximum memory constraint. Yet, ULMAT is recommended for meshes of up to 50^3 cells (125K pressure unknowns) in the User Guide. Fitting a million cells in a single mesh will make the calculation too slow. You can try splitting your domain in 8 meshes.

IFAB-AS commented 1 year ago

We did use ULMAT and a single mesh for this case to get a better understanding why the simulations crash at some point. It is a model with many PRESSURE ZONES that crashes each time with other settings (pressure increases in the zones), for this reason ULMAT was used. We also tested other suggestions:

The plan for this tests is to obtain a working solution (60 minutes simulated time with many PRESSURE ZONES) for this test case by using only the less problematic parameters such as THICKEN, many pressure iterations, 1 Mesh etc., and change the parameters afterwards for comparision with the ULMAT case (e.g. BACKING=EXPOSED, VOID or using thin obstacles etc.). By this procedure, we hope to get a better undersanding what actually is crashing the simulations and what needs to be changed in advance before starting the model. Or even better how to improve the FDS code.

As the model is under NDA, I think we cannot share this specific case. I restarted the same model using the environment variable set MKL_PARDISO_OOC_MAX_CORE_SIZE = 6000 before starting the simulation, but I think that this will not have an effect as IPARM(60)=0 is currenty used and PARDISO might still use the 2000 MB limit. It will take 3-4 days before the crash, so we will give you an update probably next week on this. We can also try to build a model, that we can share.

marcosvanella commented 1 year ago

A model similar to what you have would be very handy to test what is going on. You can send it to us directly (nist emails) too. I'm also interested in the ULMAT not starting with multiple meshes for this case. Are you using 6.7.9 release?


From: IFAB-AS @.> Sent: Tuesday, January 31, 2023 04:09 AM To: firemodels/fds @.> Cc: marcosvanella @.>; Assign @.> Subject: Re: [firemodels/fds] Case with ULMAT crashes due to memory issues (Issue #11403)

We did use ULMAT and a single mesh for this case to get a better understanding why the simulations crash at some point. It is a model with many PRESSURE ZONES that crashes each time with other settings (pressure increases in the zones), for this reason ULMAT was used. We also tested other suggestions:

The plan for this tests is to obtain a working solution (60 minutes simulated time with many PRESSURE ZONES) for this test case by using only the less problematic parameters such as THICKEN, many pressure iterations, 1 Mesh etc., and change the parameters afterwards for comparision with the ULMAT case (e.g. BACKING=EXPOSED, VOID or using thin obstacles etc.). By this procedure, we hope to get a better undersanding what actually is crashing the simulations and what needs to be changed in advance before starting the model. Or even better how to improve the FDS code.

As the model is under NDA, I think we cannot share this specific case. I restarted the same model using the environment variable set MKL_PARDISO_OOC_MAX_CORE_SIZE = 6000 before starting the simulation, but I think that this will not have an effect as IPARM(60)=0 is currenty used and PARDISO might still use the 2000 MB limit. It will take 3-4 days before the crash, so we will give you an update probably next week on this. We can also try to build a model, that we can share.

— Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffiremodels%2Ffds%2Fissues%2F11403%23issuecomment-1410002622&data=05%7C01%7C%7C639c4070578644e4dba408db036ad3e6%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638107529586221725%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=fCeT%2Fu4dljLeyMY4Xn2lDP4ujFD8Rf2niv%2F3YNCJ%2FTo%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABY23VN2LWIEHKKVH3OHLO3WVDJDXANCNFSM6AAAAAAULKXVLM&data=05%7C01%7C%7C639c4070578644e4dba408db036ad3e6%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638107529586221725%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=EM4Kby3dj1yYZs6gKD6dAmm4MrdbLjM5FFeoHUY%2B%2Bng%3D&reserved=0. You are receiving this because you were assigned.Message ID: @.***>

rmcdermo commented 1 year ago

Kevin recently added a new feature MINIMUM_ZONE_VOLUME on the MISC line. This fills void spaces below a certain volume. Together with the new treatment of normal velocity components at at wall cell, we hope this helps avoid zone-related instabilities without resorting to ULMAT. Hopefully, you can just set your VELOCITY_TOLERANCE to something reasonable.

This feature is in the latest test bundle available here. We are testing on simple cases, but it would be good if you could try this on a real-world example to see if we are indeed improving things. Thanks

IFAB-AS commented 1 year ago

It is currently running with FDS 6.7.9-933 as we had issues with rising pressure in the PRESSURE ZONES, but we also tested the case with FDS 6.7.9. I also managed to start the case with 8 meshes in the morning, it might depend on where the mesh borders are (avoid cutting through the pressure zones), but usually this is not possible due to the amount of ZONES. For this reason we try to make it work with only one mesh and ULMAT, which would make using the imported models much easier.

MINIMUM_ZONE_VOLUME sounds interesting and we will test it the next days. I will also send Marcos the test model with some additional explanation.

FireResearch-BK commented 1 year ago

As a sidenote, I've managed to set up the build chain for FDS (the instructions really help once one manages to find them), so we can also patch and compile FDS ourselves to test the initial suggestion (IPARM(60) = 1). Getting results may, however, take a while.

P.S: Is there some other way than Google Drive to get a hold of nightly builds of the FDS documentation? The link posted in Readme.md requires a login and I do not have a Google account, but I'd like to read up on MINIMUM_ZONE_VOLUME.

mcgratta commented 1 year ago

Does it require a login? I thought the link was completely open?

FireResearch-BK commented 1 year ago

Does it require a login? I thought the link was completely open?

It should be open, yes. But clicking on the link sends you to a login window. (Using Firefox on both Windows and Linux.) This also happens for the Drive hosting the nightly FDS builds, by the way.

IIRC, Google changed the formatting for links and some security settings a few years ago and this may have something to do with it. So it may be worth checking permissions for the Drives and/or generating new share links.

mcgratta commented 1 year ago

https://drive.google.com/drive/folders/0B_wB1pJL2bFQcURod1UyZTJUaEE?resourcekey=0-CW_02fKmi4jxSqW_n9_G7w&usp=sharing

Does this work?

FireResearch-BK commented 1 year ago

Does this work?

Yes, that link works just fine.

mcgratta commented 1 year ago

Yes, but I think it is old. Let me think about this some more.

mcgratta commented 1 year ago

https://drive.google.com/drive/folders/1X-gRYGPGtcewgnNiNBuho3U8zDFVqFsC?usp=sharing

Does this work?

IFAB-AS commented 1 year ago

@mcgratta Yes, this link works without an acount @marcosvanella and @rmcdermo I did send a test case that uses ULMAT with multiple meshes to Marcos that crashes during the initialization at the beginning with error code: ULMAT_H_MATRIX_LUDCMP PARDISO Sym Factor: The following ERROR was detected: -1 and forrtl: severe (157) Program Exception - access violation. We had this error code in the last year several times for various models, but I can not share these here at the platform. The -1 indicates input inconsistent in the PARDISO error table, but the same case is starting when using the standard solver. We tested it in FDS 6.7.9, 6.7.9-933 and the latest 6.7.9-1550 and the -1 error at the start was occurring in all versions for the test case. Other cases run with multiple meshes, so we do not know where the error is.

The other error ULMAT_H_MATRIX_LUDCMP PARDISO Num Factor: The following ERROR was detected: -2 does not occur at the start, but after several days when using ULMAT with 1 Mesh. Marcos has also received our test case for this. This error seems to be related to either burning away of materials and updating the pressure matrix or some internal memory leakage in PARDISO that leads to exceeding the 2000MB limit. Our tests with different FDS versions are still running and will take some time:

As the simulations are all competing for CPU cycles, you will get an update probably next week. We will test the MINIMUM_ZONE_VOLUME at a later stage as the problem seems to be not related to pressure zone in the specific test case that we did send to Marcos.

IFAB-AS commented 1 year ago

All 4 simulations have crashed:

marcosvanella commented 1 year ago

Thank you for the report. The one mesh case is still running here, up to 360s. I have not noted an increase in memory usage (it's about 25% of the node mem, 12GB), but it is possible pressure zone changes have not started yet due to burn away. We'll also test a case with IPARM[60]=1, to see if it goes further. I reproduced the multiple mesh issue, will look further in the next couple of days. About the compile flag, what OS and MPI are you using for compilation? Seems the MKL lib location is not being picked up by the compiler.

IFAB-AS commented 1 year ago

I managed to get the compile flag working by using the Intel oneAPI base/HPC toolkit in the meantime with the IPARM(60)=1 code and ULMAT. We used the Intel Fortran compiler and the Intel MPI library before, not the complete toolkits, so this was the problem. I restarted the case with 1 mesh and 8 meshes with ULMAT and iparm(60)=1, our OS is Windows 10.

mcgratta commented 1 year ago

We install the base toolkit first, then the HPC toolkit. That seems to be all that is needed to compile all our codes.

marcosvanella commented 1 year ago

@IFAB-AS do a repo update and try the cases you sent me. Please run the single mesh case that pardiso was getting out of memory. Try it without changing IPARM[60]. The case is still running here but our cluster is very slow. I added code to deallocate pardiso arrays in FDS (we didn't have that).

FireResearch-BK commented 1 year ago

@mcgratta The oneAPI base toolkit and HPC toolkit work out just fine to compile FDS on Windows. It would help to make this more clear in the FDS compilation instructions because the instructions at the moment imply to pick the separately available Fortran and MPI packages from Intel.

@marcosvanella We are going to try your fix in the next days as we're still testing other attempts at a solution.

mcgratta commented 1 year ago

Done, I edited the instruction page.

IFAB-AS commented 1 year ago

@marcosvanella A few tests have finished, others are still running. All started with set MKL_PARDISO_OOC_MAX_CORE_SIZE = 6000 in command line before starting the FDS file

We will keep you updated.

marcosvanella commented 1 year ago

Thank you @IFAB-AS. I'm not surprised about the out of memory crashes, due to not releasing Pardiso arrays memory. Did these runs with the -2 error get any further with MKL_PARDISO_OOC_MAX_CORE_SIZE = 6000 than before? Not sure what the other errors are about, might have to do with permissions to write to the directory the OOC arrays are being dumped.

FireResearch-BK commented 1 year ago

Update:

@mcgratta : Thanks for the edit. Item 2 of the "Preliminaries" section still mentions "Intel oneAPI "classic" Fortran compiler" and "Intel MPI libraries" though.

marcosvanella commented 1 year ago

The one mesh case without the Pardiso arrays deallocation (6.7.9-1564) has gone up to 621.79 sec in our cluster. In two solver re-initializations due to burn away, memory in the node jumped from 25% to 37.5%. First burn away event at 618.99 sec, changed the largest pressure zone, 840K unknowns, by one cell. Last weeks commit was intended to fix this.

DS 6.7.9-1591 with 1 mesh: Crashed after 92 seconds with BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 0 PID 1700 RUNNING AT [Hostname] = EXIT STATUS: -1073741819 (c0000005)

I found online that c0000005 in windows might be due to pointer issues. There are no ULMAT restarts due to burn away up to 92 sec, not clear what happened. Alex, please rerun this case in your system to confirm its happening all the time. I'm going to run this case here to see what I find.

FireResearch-BK commented 1 year ago

The one mesh case without the Pardiso arrays deallocation (6.7.9-1564) has gone up to 621.79 sec in our cluster. In two solver re-initializations due to burn away, memory in the node jumped from 25% to 37.5%. First burn away event at 618.99 sec, changed the largest pressure zone, 840K unknowns, by one cell. Last weeks commit was intended to fix this.

As "FDS 6.7.9-1591 with 8 meshes" is still running and past peak HRRPUV, the fix may just have worked. Fingers crossed.

DS 6.7.9-1591 with 1 mesh: Crashed after 92 seconds with BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 0 PID 1700 RUNNING AT [Hostname] = EXIT STATUS: -1073741819 (c0000005)

I found online that c0000005 in windows might be due to pointer issues. There are no ULMAT restarts due to burn away up to 92 sec, not clear what happened. Alex, please rerun this case in your system to confirm its happening all the time. I'm going to run this case here to see what I find.

We've already restarted this case as well as the "FDS 6.7.9-1550 compiled with IPARM(60)=1 and 1 mesh" case from a restart point. So far, both continue to run normally.

IFAB-AS commented 1 year ago

We stopped the simulations with FDS 6.7.9-1550 compiled with IPARM(60)=1 and compiled with IPARM(2)=0 (both had 1 mesh), due to beeing too slow. The run with FDS 6.7.9-1591 with 8 meshes reached 1091 s and Windows decided it is time to restart ;-. So far, this case seems to work, but we were not able to restart it (forrtl: severe (157): Program Exception - access violation). We also tested another case with 4 Meshes which was crashing at the start (Marcos received this by e-Mail) and this case also starts now with 6.7.9-1591.

marcosvanella commented 1 year ago

Hi Alex, The run with FDS 6.7.9-1591 with 8 meshes reached 1091 s and Windows decided it is time to restart ;-. So far, this case seems to work, but we were not able to restart it (forrtl: severe (157): Program Exception - access violation).

Please try two things for this restart.

  1. Comment the &PRES line and try restarting. This is to check if we see the exception regardless of the pressure solver.
  2. Compile the impi_intel_win_dv target and run the restart case with ULMAT. This is to see if we get more stderr information on this Access violation.

The single mesh case with latest source has made it to 365sec here.

A point to keep in mind is that every time the domain changes due to a door opening or cells burning away the ULMAT solver needs to be restarted and Poisson matrix factorizations for the new zones need to be performed. This can cost from 10 to 100 times the solve cost. It this is required every time step (say one cell burns away per step) this can make the code very slow.

IFAB-AS commented 1 year ago

I tried your 2 suggestions: 1) Restart with &PRES commented: Simulation continues normal 2) Compile the impi_intel_win_dv target: grafik

I also tested to let the case restart with the standard solver until the next restart point is written, stopped the simulation and did continue with ULMAT. This produces a numerical instability after the restart.

marcosvanella commented 1 year ago

Ok, this tells me there is something not right with RESTART + ULMAT. The state of PRESSURE_ZONEs when setting up the solver initially (different pressure zones due to compartments and pockets used to compute Poisson matrices) has to be the actual state at the RESTART time point. I'll look to see if that is what is happening, we might be able to reproduce this with a simple verification case.

IFAB-AS commented 1 year ago

The crashing of ULMAT after recalculating the pressure zones seems to be solved now, the only problem remaining is the restart of the ULMAT solver (see also https://github.com/firemodels/fds/issues/11449). Summary of our tests: The problem with PARDISO error -2 did not occur due to the RAM management in the Intel MKL library as we thought (OOC and IC mode and iparm parameters), but was related to the garbage collection in FDS, which is fixed in FDS 6.7.9-1591. The forrtl error did occur when other processes were eating up too much memory and killed an FDS process. We also tested the same case with 1 and 8 meshes (1 is still running), and we were getting basically the same HRR for this example. This means that cutting through the fire was not an issue in this case. grafik

FireResearch-BK commented 1 year ago

The core issue is solved as of 6.7.9-1591. Note that restarts with ULMAT are still broken; see #11449.