JulesKouatchou commented 4 years ago

ctm_setup:

New setting of environment variables (SETENV.commands) for the use of MPT

ctm_run.j:

The @SETENVS tag was placed before the MPI run command is issued.
Checked the exist status of the executable using the EGRESS file.

mmanyin commented 4 years ago

@mathomp4 There was a build error; "No configuration was found in your project. Please refer to https://circleci.com/docs/2.0/ to get started with your configuration." Is the CI stuff working properly for CTM ? thanks

mmanyin commented 4 years ago

@JulesKouatchou Do I understand correctly, you have changed the run-time environment variables to be appropriate for using MPT? How do we compile w/ MPT? I thought that the default was to compile with Intel-MPI.

mathomp4 commented 4 years ago

@mathomp4 There was a build error; "No configuration was found in your project. Please refer to https://circleci.com/docs/2.0/ to get started with your configuration." Is the CI stuff working properly for CTM ? thanks

@mmanyin Until #23 is merged in, there is no way for CircleCI to find a configuration since it only exists on a branch, not master. I had set up CircleCI to follow GEOSctm thinking the config file would get it. Since it might be a while, would you like me to turn off CircleCI following GEOSctm?

mmanyin commented 4 years ago

@mathomp4 There was a build error; "No configuration was found in your project. Please refer to https://circleci.com/docs/2.0/ to get started with your configuration." Is the CI stuff working properly for CTM ? thanks

@mmanyin Until #23 is merged in, there is no way for CircleCI to find a configuration since it only exists on a branch, not master. I had set up CircleCI to follow GEOSctm thinking the config file would get it. Since it might be a while, would you like me to turn off CircleCI following GEOSctm?

Actually I will go ahead with #23 . Sorry for the confusion!

mathomp4 commented 4 years ago

Well, that was unexpected.

@JulesKouatchou When you have a chance can you do a fresh clone of GEOSctm and then a fresh checkout of your branch, and then try running with MPT?

I just did a "resolve conflict" for your branch (so it could merge in) and, weirdly, Git seems to say now that the ctm_setup now isn't "new". I mean, it seems to have all the right bits for MAPL 2 on MPT, but...weird.

On the plus side, @mmanyin, it looks like that "resolve conflict" is letting CircleCI run!

JulesKouatchou commented 4 years ago

@mathomp4 I will and let you know.

JulesKouatchou commented 4 years ago

@mathomp4 When I dod:

  git clone git@github.com:GEOS-ESM/GEOSctm.git
 cd GEOSctm/
 git checkout -b jkGEOSctm_on_SLESS12
 checkout_externals
 source @env/g5_modules

Intel MPI get loaded. I need MPT.

mathomp4 commented 4 years ago

@mathomp4 When I dod:

  git clone git@github.com:GEOS-ESM/GEOSctm.git
 cd GEOSctm/
 git checkout -b jkGEOSctm_on_SLESS12
 checkout_externals
 source @env/g5_modules

Intel MPI get loaded. I need MPT.

Jules,

You'll need to:

cp /gpfsm/dhome/mathomp4/GitG5Modules/SLES12/6.0.4/g5_modules.intel1805.mpt217 @env/g5_modules

to get MPT as an MPI stack

JulesKouatchou commented 4 years ago

@mathomp4 Here are my steps:

git clone git@github.com:GEOS-ESM/GEOSctm.git

cd GEOSctm git checkout jkGEOSctm_on_SLESS12 checkout_externals cp /gpfsm/dhome/mathomp4/GitG5Modules/SLES12/6.0.4/g5_modules.intel1805.mpt217 @env/g5_modules source @env/g5_modules

Things appear to be fine. I am currently doing a long run to make sure that the code does not crash.

Thanks.

mathomp4 commented 4 years ago

Sounds good! If all works, you can set the appropriate "required label". I'm guessing 0-diff is good enough since your changes can't change results, right?

JulesKouatchou commented 4 years ago

@mathomp4 This is the first step. I want to code to be able to compile and run. Ideally, I want the same code to compile and run on SLESS11 nodes too (though they will disappear soon). I will then be able to do the comparison.

JulesKouatchou commented 4 years ago

@mathomp4 My long run did not have any issue. You asked me to copy the file g5_modules.intel1805.mpt217. Is it possible to make it part of the repository? I want MPT module to be the default for the CTM.

mathomp4 commented 4 years ago

Jules,

We can do that for sure, but then when the hundreds of Skylake nodes go online for the general user they will not be able to use them. Intel MPI allows users to use every node on NCCS.

Before we issue that, ctm_run.j should be altered so that if anyone ever tries to run on the Skylakes at NCCS with MPT, the CTM must immediately error out with a non-zero status code. And maybe a note saying what’s happening so that the user doesn’t try to contact NCCS or the SI Team. I mean, the job will crash anyway, but it will be an obscure looking loader error I think.

JulesKouatchou commented 4 years ago

@mathomp4 Sorry that I coming back to it now. Wondering if there could be (for now) a flag that set MPT as first option and Intel MPI as the second option. I am willing to modify the ctm_run.j file if I know what options are available in g5_modules.

mathomp4 commented 4 years ago

@mathomp4 Sorry that I coming back to it now. Wondering if there could be (for now) a flag that set MPT as first option and Intel MPI as the second option. I am willing to modify the ctm_run.j file if I know what options are available in g5_modules.

@JulesKouatchou I don't think so, not as long as GEOS uses g5_modules. The issue is that it is a script that is run and a file that is sourced. This severely limits its flexibility because you can break it very easily (for example, you can not do source g5_modules -option).

If you require MPT, I can create a special branch/tag of ESMA_env for you.

You should also contact NCCS and let them know that Intel MPI does not work for your code. They will be interested in this and would probably want to contact Intel regarding the fault.

mmanyin commented 4 years ago

@mathomp4 Sorry that I coming back to it now. Wondering if there could be (for now) a flag that set MPT as first option and Intel MPI as the second option. I am willing to modify the ctm_run.j file if I know what options are available in g5_modules.

@JulesKouatchou I don't think so, not as long as GEOS uses g5_modules. The issue is that it is a script that is run and a file that is sourced. This severely limits its flexibility because you can break it very easily (for example, you can not do source g5_modules -option).

If you require MPT, I can create a special branch/tag of ESMA_env for you.

You should also contact NCCS and let them know that Intel MPI does not work for your code. They will be interested in this and would probably want to contact Intel regarding the fault.

I have seen Intel MPI crash during Finalize, when running the GCM under SLES12. @JulesKouatchou please CC me when you contact NCCS about this problem; I will open a case as well, and CC you.

JulesKouatchou commented 4 years ago

@mathomp4 @mmanyin I have tried to build the simplest test case possible (using Intel MPI on SLESS12 nodes) where the code does not exist gracefully. So far I have not duplicated the problem with a purely MPI program and a ESMF program. I now want to try a code that uses MAPL.

mathomp4 commented 4 years ago

@JulesKouatchou We might have a workaround for the MPI_Finalize issue. I found an MPI command which essentially "turns off error output" and @bena-nasa seemed to be able to show it helped.

We are looking at adding it into MAPL with some good protections so we don't turn off all MPI errors.

JulesKouatchou commented 4 years ago

@mathomp4 Great! Let me know when the workaround is ready so that I can test it.

mathomp4 commented 4 years ago

@mathomp4 Great! Let me know when the workaround is ready so that I can test it.

Jules, try out MAPL v2.0.6 (aka git checkout v2.0.6 in MAPL)

Note, you're behind on a lot of things in CTM in it's mepo/externals bits) but v2.0.0 and v2.0.6 are still similar.

JulesKouatchou commented 4 years ago

@mathomp4 Here is a summary of what happened when I used MAPL v2.0.6.

I used the modules comp/intel/18.0.5.274 and mpi/impi/19.1.0.166.
GEOS CTM exited gracefully while doing short runs (few days).
GEOS CTM abruptly crashed after about 15 days of integration. The error message is:

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 25 PID 3877 RUNNING AT borgo007 = KILLED BY SIGNAL: 9 (Killed)

It seems that MPT might be the option (for now) for the CTM.

mathomp4 commented 4 years ago

@JulesKouatchou Well that's annoying. Can you point me to the output so I can look at the errors?

Also, if you can, can you try one more test? It would be interesting to see if MAPL 2.1 helps at all. Plus you can be the first to try the CTM with it.

For that, you'll want to clone a new CTM somewhere rather than re-use the current one. Then after cloning and doing the mepo/checkout_externals update:

@env to v3.0.0 @MAPL to v2.1.0 @cmake to v2.1.0

JulesKouatchou commented 4 years ago

@mathomp4 Will do and let you know.

JulesKouatchou commented 4 years ago

@mathomp4 I could not checkout @env v3.0.0:

error: pathspec 'v3.0.0' did not match any file(s) known to git

I am currently in v2.0.2.

mathomp4 commented 4 years ago

Sigh. I’m an idiot. @env is v2.1.0 and @cmake is v3.0.0.

Sorry about that. MAPL is, of course, v2.1.0

JulesKouatchou commented 4 years ago

@mathomp4 Here is another issue:

-- Found MKL: /usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64/libmkl_intel _lp64.so;/usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64/libmkl_sequential .so;/usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64/libmkl_core.so;-pthrea d
-- Found Python: /usr/bin/python3.4 (found version "3.4.6") found components: Interpreter

-- [GEOSctm] (1.0) [f278c74]

-- [MAPL] (2.1.0) [e23f20a] -- Found Perl: /usr/bin/perl (found version "5.18.2") CMake Error at src/Shared/@MAPL/GMAO_pFIO/tests/CMakeLists.txt:68 (string): string sub-command REPLACE requires at least four arguments.

-- Found PythonInterp: /usr/local/other/python/GEOSpyD/2019.10_py2.7/2020-01-15/bin/python (found version "2.7.16") -- Configuring incomplete, errors occurred! See also "/discover/nobackup/jkouatch/GEOS_CTM/GitRepos/MAPL2.1/GEOSctm/build/CMakeFiles/CMakeOutput.log" . See also "/discover/nobackup/jkouatch/GEOS_CTM/GitRepos/MAPL2.1/GEOSctm/build/CMakeFiles/CMakeError.log".

mathomp4 commented 4 years ago

@JulesKouatchou I think your @cmake is at v2.1.0. That's the one that needs to be at @v3.0.0

JulesKouatchou commented 4 years ago

@mathomp4 I used the following:

@env v2.1.0 @cmake v3.0.0 @MAPL v2.1.0

and got the same error message after about 15 days of integration.

My code is at: /gpfsm/dnb32/jkouatch/GEOS_CTM/GitRepos/MAPL2.1/GEOSctm

and my experiment directory at: /gpfsm/dnb32/jkouatch/GEOS_CTM/GitRepos/MAPL2.1/IdealPT

mathomp4 commented 4 years ago

@JulesKouatchou Well that is baffling. You don't get any kind of error at all?! I mean, you get the "we are crashing now" error, but nothing seems to come from the actual code.

I might need to get @bena-nasa or @atrayano on this. You have found an actual bug we need to fix.

Can I have you try one more thing? Can you try editing your ctm_run.j to add these environment variables to around line 894. Just before the run command? These are all the variables we've found so far that seem to help the issues that Bill has seen running on the system. My guess is won't help, but we can try.

setenv PSM2_MEMORY large
setenv I_MPI_ADJUST_GATHERV 3
setenv I_MPI_ADJUST_ALLREDUCE 12
setenv I_MPI_EXTRA_FILESYSTEM 1
setenv I_MPI_EXTRA_FILESYSTEM_FORCE gpfs
setenv ROMIO_FSTYPE_FORCE "gpfs:"

mathomp4 commented 4 years ago

@JulesKouatchou Actually, I forgot you were splitting errors. The real error was in the .e file.

I might have a different thing for you to try. You seem to have hit an error others sometimes do on the Haswells. Intel provided some other advice:

Please try to tune maximal virtual size of “shm-heap” by I_MPI_SHM_HEAP_VSIZE ( https://software.intel.com/en-us/mpi-developer-reference-linux-other-environment-variables )

For example, try to set I_MPI_SHM_HEAP_VSIZE=4096 (it set 4096 MB per rank for virtual size of “shm-heap”). If it will works fine please try to decrease the size to for example I_MPI_SHM_HEAP_VSIZE=2048 and so on (1024, 512, 256, ..).

Please find and tell us the minimum size of I_MPI_SHM_HEAP_VSIZE when the program works fine. We can increase default value of the I_MPI_SHM_HEAP_VSIZE in the future Intel MPI release.

mathomp4 commented 4 years ago

Note if you don't have time to run these tests, let me know and I can work with Ben or someone on and we can quickly try them all out.

JulesKouatchou commented 4 years ago

@mathomp4 I will run the tests and let you know.

mathomp4 commented 4 years ago

Thanks. Note I found a bug with MAPL and MPT today so even moving to MPT might take a fix. Go me!

JulesKouatchou commented 4 years ago

@mathomp4 Conducting one 4-month run (I_MPI_SHM_HEAP_VSIZE=4096). So far at the end of the first month and still going. That is a great news as I was not able to pass 15 days of integration before.

mathomp4 commented 4 years ago

@mathomp4 Conducting one 4-month run (I_MPI_SHM_HEAP_VSIZE=4096). So far at the end of the first month and still going. That is a great news as I was not able to pass 15 days of integration before.

Good to hear!

As Intel said, if you can try lowering that in halves? The larger that is, the more memory Intel MPI reserves per-process, so we want the smallest value that works for you.

JulesKouatchou commented 4 years ago

@mathomp4 So far the setting of I_MPI_SHM_HEAP_VSIZ with 4096, 2048, 1024, 512 and 256 are working. I will soon start testing with 128.

mathomp4 commented 4 years ago

@JulesKouatchou Thanks for doing this. Now my fear is that it'll work with I_MPI_SHM_HEAP_VSIZE=1 which would mean something a bit fundamental.

But you've already lowered it a lot which is nice.

JulesKouatchou commented 4 years ago

@mathomp4 Unfortunately, the lowest setting might be I_MPI_SHM_HEAP_VSIZE=512. The run with 256 crashed (same error message as before) after 2 months and 27 days of integration.

mathomp4 commented 4 years ago

@mathomp4 Unfortunately, the lowest setting might be I_MPI_SHM_HEAP_VSIZE=512. The run with 256 crashed (same error message as before) after 2 months and 27 days of integration.

Still, that is good to know. I'll pass it on to Scott to test and to Intel.

mathomp4 commented 4 years ago

I suppose you could integrate that into ctm_setup or run or wherever. That way it's on by default for you. I might do the same in GCM.

JulesKouatchou commented 4 years ago

@mathomp4 I included the I_MPI_SHM_HEAP_VSIZE=512 setting on my CTM branch jkGEOSctm_on_SLESS12. I did several "long" tests with Intel MPI to confirm that the code not longer crashes and exits gracefully.

mathomp4 commented 4 years ago

@mathomp4 I included the I_MPI_SHM_HEAP_VSIZE=512 setting on my CTM branch jkGEOSctm_on_SLESS12. I did several "long" tests with Intel MPI to confirm that the code not longer crashes and exits gracefully.

@JulesKouatchou Thanks for moving that @SETENVS as it was in the wrong place. If you can, you might want to add two more that the GCM is now running with by default:

setenv I_MPI_ADJUST_ALLREDUCE 12
setenv I_MPI_ADJUST_GATHERV 3

I think these are more important on the Skylakes, but GCM will be running with them for Intel MPI everywhere. The first fixes an issue at high-resolution for Bill, so you might never see it in a CTM, but the second one fixes an issue Ben was able to trigger at C180 at 8x48 which isn't that enormous.

I know the GCM (for all our testing) is zero-diff with them. I have to imagine the CTM would be as well, but I don't know how to test.

But that can also be a second PR if you like that I can make after you get this in?

JulesKouatchou commented 4 years ago

@mathomp4 Thank you for the new settings. I want to have something that works on SLESS12 first before doing internal CTM tests.

weiyuan-jiang commented 4 years ago

Hi, Jules,

Do you mean with this env “I_MPI_SHM_HEAP_VSIZE=512” , there won’t be MPI_Finalize Failure?

Thanks Weiyuan

From: JulesKouatchou notifications@github.com Reply-To: GEOS-ESM/GEOSctm reply@reply.github.com Date: Monday, April 27, 2020 at 11:31 AM To: GEOS-ESM/GEOSctm GEOSctm@noreply.github.com Cc: "Jiang, Weiyuan (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]" weiyuan.jiang@nasa.gov, Review requested review_requested@noreply.github.com Subject: [EXTERNAL] Re: [GEOS-ESM/GEOSctm] March 31, 2020 - Jules Kouatchou (#27)

@mathomp4https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mathomp4&d=DwMCaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=SZHGQvwk5qYXensoM4g4Fr_aEfq5rxS_qQY1paompMc&m=2fllAmQ-PT5mNz9FpgtmpR7_6dP-rlhj9H_BzKRRze4&s=PKQX1ryk9IyVJ-Dzh5wAmjXeW3ppTh9UkWONlDhqkZM&e= I included the I_MPI_SHM_HEAP_VSIZE=512 setting on my CTM branch jkGEOSctm_on_SLESS12. I did several "long" tests with Intel MPI to confirm that the code not longer crashes and exits gracefully.

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GEOS-2DESM_GEOSctm_pull_27-23issuecomment-2D620059451&d=DwMCaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=SZHGQvwk5qYXensoM4g4Fr_aEfq5rxS_qQY1paompMc&m=2fllAmQ-PT5mNz9FpgtmpR7_6dP-rlhj9H_BzKRRze4&s=Fwv1vo2GuEEY4SPyWR65cOENytizVUtu7MfcVATaQIY&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AMQTYOIARZI4BALKEJ44SF3ROWQLDANCNFSM4LX5ASJQ&d=DwMCaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=SZHGQvwk5qYXensoM4g4Fr_aEfq5rxS_qQY1paompMc&m=2fllAmQ-PT5mNz9FpgtmpR7_6dP-rlhj9H_BzKRRze4&s=_QHYswlbTYZ2-jGZFwNlYos1xZ9XYHHVAsVFfXA7HOg&e=.

mathomp4 commented 4 years ago

@weiyuan-jiang I think the I_MPI_SHM_HEAP_VSIZE variable helps with the "unexpected failures" in the runs. The MPI_Finalize issues should be taken care of with newer MAPL with the workaround we did in MAPL_Cap.

Note that Scott is currently testing the GCM with I_MPI_SHM_HEAP_VSIZE. For him it's looking like anything other than zero is what's needed, but we might go with I_MPI_SHM_HEAP_VSIZE=512 since @JulesKouatchou found actual proof it's a useful number.

I've asked NCCS about their thoughts on it (note: this value is probably only needed on Haswell, so I'll probably code up the GCM's scripts to apply it only if Intel MPI + Haswell).

mathomp4 commented 4 years ago

Also, Bill Putman has, I think, four other variable he uses at night for his runs. I think three of them might be considered "generally useful" but I'm waiting for NCCS to respond before I add them to the GCM. If they are, I'll pass them along here as well.

mmanyin commented 4 years ago

@JulesKouatchou Could you please update the components.yaml and Externals.cfg to reflect the versions of the repo's that you are satisfied with? (See your comment from April 18 above) Also, do we still need to use MPT to prevent crashing?

JulesKouatchou commented 4 years ago

@mmanyin The last experiments that I did was about two weeks ago. I did several long runs and noticed that the code crashed after about 165 days of integration (in one job segment) even by increasing the value of I_MPI_SHM_HEAP_VSIZE. @mathomp4 mentioned that Bill is using other settings that we need to include too. In another matter, the code is still not exiting gracefully when I use Intel MPI.

Do you want be to add the versions below as default for the CTM?

@env v2.1.0 @cmake v3.0.0 @mapl v2.1.0

mathomp4 commented 4 years ago

@mmanyin The last experiments that I did was about two weeks ago. I did several long runs and noticed that the code crashed after about 165 days of integration (in one job segment) even by increasing the value of I_MPI_SHM_HEAP_VSIZE. @mathomp4 mentioned that Bill is using other settings that we need to include too. In another matter, the code is still not exiting gracefully when I use Intel MPI.

Do you want be to add the versions below as default for the CTM?

@env v2.1.0 @cmake v3.0.0 @mapl v2.1.0

I think the graceful exit is probably due to not new enough MAPL. We fixed that we think in 2.1.3. The GCM is currently using (in master, not yet in release):

ESMA_env v2.1.5
ESMA_cmake v3.0.3
MAPL v2.1.4

The other Bill flags probably won't help much. He has some that I think only affect high res runs. The important ones are the I_MPI_ADJUST_ALLREDUCE, I_MPI_ADJUST_GATHERV, and the I_MPI_SHM_HEAP_VSIZE we think.

JulesKouatchou commented 4 years ago

@mathomp4 and @mmanyin I used:

ESMA_env v2.1.5
ESMA_cmake v3.0.3
MAPL v2.1.4

and also the settings I_MPI_ADJUST_ALLREDUCE, I_MPI_ADJUST_GATHERV, and I_MPI_SHM_HEAP_VSIZE (512, 1024, 2048). The code exited gracefully but still crashed at the same integration date regardless the value of I_MPI_SHM_HEAP_VSIZE.

GEOS-ESM / GEOSctm

March 31, 2020 - Jules Kouatchou #27

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 25 PID 3877 RUNNING AT borgo007 = KILLED BY SIGNAL: 9 (Killed)