E3SM-Project / ACME-ECP

E3SM MMF for DoE ECP project
Other
9 stars 1 forks source link

Mrnorman/crm/make everything allocated #32

Closed mrnorman closed 6 years ago

mrnorman commented 6 years ago

I've gone through all of the major data and made everything allocated and deallocated every time the CRM is called rather than leaving the data essentially static with automatic Fortran arrays. For PGI with the OpenACC port, this relieved some wrong answer bugs I was seeing that were very hard to track down. Hopefully this will help the robustness of the CRM with PGI for master as well.

I also change nvcols to ncrms and vc to icrm so that it is more clear.

I don't have time to check with the full ACME-ECP code, but this works successfully with the standalone code. @whannah1 , please check with 1mom and 2mom micro if you get the chance.

mrnorman commented 6 years ago

I will mention that there are still some automatic Fortran arrays scattered throughout, and they aren't exactly small. However, this takes care of the vast majority of the data the CRM uses.

mrnorman commented 6 years ago

Cool :). Do you want me to integrate today?

whannah1 commented 6 years ago

Matt, Chris is already working on the integration.

mrnorman commented 6 years ago

Word

crjones-amath commented 6 years ago

@mrnorman @whannah1 My integration attempt failed. Both FSP1V1-TEST and FSP2V1-TEST seg-faulted at first timestep. There were also NLCOMP diffs for all tests, and they all failed the "BASELINE master" test.

There were merge conflicts. I've updated the attempted merge to branch crjones/crm/allocate-merge in case you want to see if I made any clear mistakes in resolving conflicts.

mrnorman commented 6 years ago

@crjones-amath , did it crash for gnu, pgi, or intel? (or all of them)? I'm working on this now on my laptop, hoping to track it down quickly

crjones-amath commented 6 years ago

@mrnorman For me it crashed with intel on edison. @whannah1 mentioned to me that his debug run on titan crashed as well, presumably with pgi.

mrnorman commented 6 years ago

@whannah1 , @crjones-amath I'm getting the following error: "kurant() - the number of cycles exceeded 4." This usually means it's just a wrong answer. My gut tells me this has to do with data persistence / initialization. The data inherently persists in the original code, and it doesn't when we allcoate every time. So I'm going to start everything out at zero and see if that fixes the problem (since I do that in the GPU code anyway). Hopefully I don't have to track down any issues with data that the model assumed kept its previous value (which would inherently be a bug, and we already fixed a few of those).

mrnorman commented 6 years ago

Also, this might explain why my standalone model works fine but full ACME-MMF crashes

whannah1 commented 6 years ago

My recent run finally gave a FPE error and core file that indicates line 154 in src/physics/crm/diagnose.F90 I don't see an obvious problem with any of those variables yet though.

mrnorman commented 6 years ago

Core files usually mean segfault. And I thought FPEs only kill a simulation if you turn on FP traps, right?

mrnorman commented 6 years ago

Ugh, no dice. Build times are < 1min on my laptop, so I'll just re-do the work while testing full ACME the whole time.

whannah1 commented 6 years ago

My run was in debug mode, so it was set up to catch FPEs.

mrnorman commented 6 years ago

Killing this PR and creating a new one based off the branch I just created