Open JulesKouatchou opened 4 years ago
@mathomp4 There was a build error; "No configuration was found in your project. Please refer to https://circleci.com/docs/2.0/ to get started with your configuration." Is the CI stuff working properly for CTM ? thanks
@JulesKouatchou Do I understand correctly, you have changed the run-time environment variables to be appropriate for using MPT? How do we compile w/ MPT? I thought that the default was to compile with Intel-MPI.
@mathomp4 There was a build error; "No configuration was found in your project. Please refer to https://circleci.com/docs/2.0/ to get started with your configuration." Is the CI stuff working properly for CTM ? thanks
@mmanyin Until #23 is merged in, there is no way for CircleCI to find a configuration since it only exists on a branch, not master. I had set up CircleCI to follow GEOSctm thinking the config file would get it. Since it might be a while, would you like me to turn off CircleCI following GEOSctm?
@mathomp4 There was a build error; "No configuration was found in your project. Please refer to https://circleci.com/docs/2.0/ to get started with your configuration." Is the CI stuff working properly for CTM ? thanks
@mmanyin Until #23 is merged in, there is no way for CircleCI to find a configuration since it only exists on a branch, not master. I had set up CircleCI to follow GEOSctm thinking the config file would get it. Since it might be a while, would you like me to turn off CircleCI following GEOSctm?
Actually I will go ahead with #23 . Sorry for the confusion!
Well, that was unexpected.
@JulesKouatchou When you have a chance can you do a fresh clone of GEOSctm and then a fresh checkout of your branch, and then try running with MPT?
I just did a "resolve conflict" for your branch (so it could merge in) and, weirdly, Git seems to say now that the ctm_setup
now isn't "new". I mean, it seems to have all the right bits for MAPL 2 on MPT, but...weird.
On the plus side, @mmanyin, it looks like that "resolve conflict" is letting CircleCI run!
@mathomp4 I will and let you know.
@mathomp4 When I dod:
git clone git@github.com:GEOS-ESM/GEOSctm.git
cd GEOSctm/
git checkout -b jkGEOSctm_on_SLESS12
checkout_externals
source @env/g5_modules
Intel MPI get loaded. I need MPT.
@mathomp4 When I dod:
git clone git@github.com:GEOS-ESM/GEOSctm.git cd GEOSctm/ git checkout -b jkGEOSctm_on_SLESS12 checkout_externals source @env/g5_modules
Intel MPI get loaded. I need MPT.
Jules,
You'll need to:
cp /gpfsm/dhome/mathomp4/GitG5Modules/SLES12/6.0.4/g5_modules.intel1805.mpt217 @env/g5_modules
to get MPT as an MPI stack
@mathomp4 Here are my steps:
git clone git@github.com:GEOS-ESM/GEOSctm.git
cd GEOSctm git checkout jkGEOSctm_on_SLESS12 checkout_externals cp /gpfsm/dhome/mathomp4/GitG5Modules/SLES12/6.0.4/g5_modules.intel1805.mpt217 @env/g5_modules source @env/g5_modules
Things appear to be fine. I am currently doing a long run to make sure that the code does not crash.
Thanks.
Sounds good! If all works, you can set the appropriate "required label". I'm guessing 0-diff
is good enough since your changes can't change results, right?
@mathomp4 This is the first step. I want to code to be able to compile and run. Ideally, I want the same code to compile and run on SLESS11 nodes too (though they will disappear soon). I will then be able to do the comparison.
@mathomp4 My long run did not have any issue. You asked me to copy the file g5_modules.intel1805.mpt217. Is it possible to make it part of the repository? I want MPT module to be the default for the CTM.
Jules,
We can do that for sure, but then when the hundreds of Skylake nodes go online for the general user they will not be able to use them. Intel MPI allows users to use every node on NCCS.
Before we issue that, ctm_run.j should be altered so that if anyone ever tries to run on the Skylakes at NCCS with MPT, the CTM must immediately error out with a non-zero status code. And maybe a note saying what’s happening so that the user doesn’t try to contact NCCS or the SI Team. I mean, the job will crash anyway, but it will be an obscure looking loader error I think.
@mathomp4 Sorry that I coming back to it now. Wondering if there could be (for now) a flag that set MPT as first option and Intel MPI as the second option. I am willing to modify the ctm_run.j file if I know what options are available in g5_modules.
@mathomp4 Sorry that I coming back to it now. Wondering if there could be (for now) a flag that set MPT as first option and Intel MPI as the second option. I am willing to modify the ctm_run.j file if I know what options are available in g5_modules.
@JulesKouatchou I don't think so, not as long as GEOS uses g5_modules
. The issue is that it is a script that is run and a file that is sourced. This severely limits its flexibility because you can break it very easily (for example, you can not do source g5_modules -option
).
If you require MPT, I can create a special branch/tag of ESMA_env for you.
You should also contact NCCS and let them know that Intel MPI does not work for your code. They will be interested in this and would probably want to contact Intel regarding the fault.
@mathomp4 Sorry that I coming back to it now. Wondering if there could be (for now) a flag that set MPT as first option and Intel MPI as the second option. I am willing to modify the ctm_run.j file if I know what options are available in g5_modules.
@JulesKouatchou I don't think so, not as long as GEOS uses
g5_modules
. The issue is that it is a script that is run and a file that is sourced. This severely limits its flexibility because you can break it very easily (for example, you can not dosource g5_modules -option
).If you require MPT, I can create a special branch/tag of ESMA_env for you.
You should also contact NCCS and let them know that Intel MPI does not work for your code. They will be interested in this and would probably want to contact Intel regarding the fault.
I have seen Intel MPI crash during Finalize, when running the GCM under SLES12. @JulesKouatchou please CC me when you contact NCCS about this problem; I will open a case as well, and CC you.
@mathomp4 @mmanyin I have tried to build the simplest test case possible (using Intel MPI on SLESS12 nodes) where the code does not exist gracefully. So far I have not duplicated the problem with a purely MPI program and a ESMF program. I now want to try a code that uses MAPL.
@JulesKouatchou We might have a workaround for the MPI_Finalize issue. I found an MPI command which essentially "turns off error output" and @bena-nasa seemed to be able to show it helped.
We are looking at adding it into MAPL with some good protections so we don't turn off all MPI errors.
@mathomp4 Great! Let me know when the workaround is ready so that I can test it.
@mathomp4 Great! Let me know when the workaround is ready so that I can test it.
Jules, try out MAPL v2.0.6 (aka git checkout v2.0.6
in MAPL)
Note, you're behind on a lot of things in CTM in it's mepo/externals bits) but v2.0.0 and v2.0.6 are still similar.
@mathomp4 Here is a summary of what happened when I used MAPL v2.0.6.
It seems that MPT might be the option (for now) for the CTM.
@JulesKouatchou Well that's annoying. Can you point me to the output so I can look at the errors?
Also, if you can, can you try one more test? It would be interesting to see if MAPL 2.1 helps at all. Plus you can be the first to try the CTM with it.
For that, you'll want to clone a new CTM somewhere rather than re-use the current one. Then after cloning and doing the mepo/checkout_externals update:
@env to v3.0.0 @MAPL to v2.1.0 @cmake to v2.1.0
@mathomp4 Will do and let you know.
@mathomp4 I could not checkout @env v3.0.0:
error: pathspec 'v3.0.0' did not match any file(s) known to git
I am currently in v2.0.2.
Sigh. I’m an idiot. @env is v2.1.0 and @cmake is v3.0.0.
Sorry about that. MAPL is, of course, v2.1.0
@mathomp4 Here is another issue:
-- Found MKL: /usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64/libmkl_intel
_lp64.so;/usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64/libmkl_sequential
.so;/usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64/libmkl_core.so;-pthrea
d
-- Found Python: /usr/bin/python3.4 (found version "3.4.6") found components: Interpreter
-- [GEOSctm] (1.0) [f278c74]
-- [MAPL] (2.1.0) [e23f20a] -- Found Perl: /usr/bin/perl (found version "5.18.2") CMake Error at src/Shared/@MAPL/GMAO_pFIO/tests/CMakeLists.txt:68 (string): string sub-command REPLACE requires at least four arguments.
-- Found PythonInterp: /usr/local/other/python/GEOSpyD/2019.10_py2.7/2020-01-15/bin/python (found version "2.7.16") -- Configuring incomplete, errors occurred! See also "/discover/nobackup/jkouatch/GEOS_CTM/GitRepos/MAPL2.1/GEOSctm/build/CMakeFiles/CMakeOutput.log" . See also "/discover/nobackup/jkouatch/GEOS_CTM/GitRepos/MAPL2.1/GEOSctm/build/CMakeFiles/CMakeError.log".
@JulesKouatchou I think your @cmake is at v2.1.0. That's the one that needs to be at @v3.0.0
@mathomp4 I used the following:
@env v2.1.0 @cmake v3.0.0 @MAPL v2.1.0
and got the same error message after about 15 days of integration.
My code is at: /gpfsm/dnb32/jkouatch/GEOS_CTM/GitRepos/MAPL2.1/GEOSctm
and my experiment directory at: /gpfsm/dnb32/jkouatch/GEOS_CTM/GitRepos/MAPL2.1/IdealPT
@JulesKouatchou Well that is baffling. You don't get any kind of error at all?! I mean, you get the "we are crashing now" error, but nothing seems to come from the actual code.
I might need to get @bena-nasa or @atrayano on this. You have found an actual bug we need to fix.
Can I have you try one more thing? Can you try editing your ctm_run.j to add these environment variables to around line 894. Just before the run command? These are all the variables we've found so far that seem to help the issues that Bill has seen running on the system. My guess is won't help, but we can try.
setenv PSM2_MEMORY large
setenv I_MPI_ADJUST_GATHERV 3
setenv I_MPI_ADJUST_ALLREDUCE 12
setenv I_MPI_EXTRA_FILESYSTEM 1
setenv I_MPI_EXTRA_FILESYSTEM_FORCE gpfs
setenv ROMIO_FSTYPE_FORCE "gpfs:"
@JulesKouatchou Actually, I forgot you were splitting errors. The real error was in the .e file.
I might have a different thing for you to try. You seem to have hit an error others sometimes do on the Haswells. Intel provided some other advice:
Please try to tune maximal virtual size of “shm-heap” by I_MPI_SHM_HEAP_VSIZE ( https://software.intel.com/en-us/mpi-developer-reference-linux-other-environment-variables )
For example, try to set I_MPI_SHM_HEAP_VSIZE=4096 (it set 4096 MB per rank for virtual size of “shm-heap”). If it will works fine please try to decrease the size to for example I_MPI_SHM_HEAP_VSIZE=2048 and so on (1024, 512, 256, ..).
Please find and tell us the minimum size of I_MPI_SHM_HEAP_VSIZE when the program works fine. We can increase default value of the I_MPI_SHM_HEAP_VSIZE in the future Intel MPI release.
Note if you don't have time to run these tests, let me know and I can work with Ben or someone on and we can quickly try them all out.
@mathomp4 I will run the tests and let you know.
Thanks. Note I found a bug with MAPL and MPT today so even moving to MPT might take a fix. Go me!
@mathomp4 Conducting one 4-month run (I_MPI_SHM_HEAP_VSIZE=4096). So far at the end of the first month and still going. That is a great news as I was not able to pass 15 days of integration before.
@mathomp4 Conducting one 4-month run (I_MPI_SHM_HEAP_VSIZE=4096). So far at the end of the first month and still going. That is a great news as I was not able to pass 15 days of integration before.
Good to hear!
As Intel said, if you can try lowering that in halves? The larger that is, the more memory Intel MPI reserves per-process, so we want the smallest value that works for you.
@mathomp4 So far the setting of I_MPI_SHM_HEAP_VSIZ with 4096, 2048, 1024, 512 and 256 are working. I will soon start testing with 128.
@JulesKouatchou Thanks for doing this. Now my fear is that it'll work with I_MPI_SHM_HEAP_VSIZE=1
which would mean something a bit fundamental.
But you've already lowered it a lot which is nice.
@mathomp4 Unfortunately, the lowest setting might be I_MPI_SHM_HEAP_VSIZE=512. The run with 256 crashed (same error message as before) after 2 months and 27 days of integration.
@mathomp4 Unfortunately, the lowest setting might be I_MPI_SHM_HEAP_VSIZE=512. The run with 256 crashed (same error message as before) after 2 months and 27 days of integration.
Still, that is good to know. I'll pass it on to Scott to test and to Intel.
I suppose you could integrate that into ctm_setup
or run
or wherever. That way it's on by default for you. I might do the same in GCM.
@mathomp4 I included the I_MPI_SHM_HEAP_VSIZE=512 setting on my CTM branch jkGEOSctm_on_SLESS12. I did several "long" tests with Intel MPI to confirm that the code not longer crashes and exits gracefully.
@mathomp4 I included the I_MPI_SHM_HEAP_VSIZE=512 setting on my CTM branch jkGEOSctm_on_SLESS12. I did several "long" tests with Intel MPI to confirm that the code not longer crashes and exits gracefully.
@JulesKouatchou Thanks for moving that @SETENVS
as it was in the wrong place. If you can, you might want to add two more that the GCM is now running with by default:
setenv I_MPI_ADJUST_ALLREDUCE 12
setenv I_MPI_ADJUST_GATHERV 3
I think these are more important on the Skylakes, but GCM will be running with them for Intel MPI everywhere. The first fixes an issue at high-resolution for Bill, so you might never see it in a CTM, but the second one fixes an issue Ben was able to trigger at C180 at 8x48 which isn't that enormous.
I know the GCM (for all our testing) is zero-diff with them. I have to imagine the CTM would be as well, but I don't know how to test.
But that can also be a second PR if you like that I can make after you get this in?
@mathomp4 Thank you for the new settings. I want to have something that works on SLESS12 first before doing internal CTM tests.
Hi, Jules,
Do you mean with this env “I_MPI_SHM_HEAP_VSIZE=512” , there won’t be MPI_Finalize Failure?
Thanks Weiyuan
From: JulesKouatchou notifications@github.com Reply-To: GEOS-ESM/GEOSctm reply@reply.github.com Date: Monday, April 27, 2020 at 11:31 AM To: GEOS-ESM/GEOSctm GEOSctm@noreply.github.com Cc: "Jiang, Weiyuan (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]" weiyuan.jiang@nasa.gov, Review requested review_requested@noreply.github.com Subject: [EXTERNAL] Re: [GEOS-ESM/GEOSctm] March 31, 2020 - Jules Kouatchou (#27)
@mathomp4https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mathomp4&d=DwMCaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=SZHGQvwk5qYXensoM4g4Fr_aEfq5rxS_qQY1paompMc&m=2fllAmQ-PT5mNz9FpgtmpR7_6dP-rlhj9H_BzKRRze4&s=PKQX1ryk9IyVJ-Dzh5wAmjXeW3ppTh9UkWONlDhqkZM&e= I included the I_MPI_SHM_HEAP_VSIZE=512 setting on my CTM branch jkGEOSctm_on_SLESS12. I did several "long" tests with Intel MPI to confirm that the code not longer crashes and exits gracefully.
— You are receiving this because your review was requested. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GEOS-2DESM_GEOSctm_pull_27-23issuecomment-2D620059451&d=DwMCaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=SZHGQvwk5qYXensoM4g4Fr_aEfq5rxS_qQY1paompMc&m=2fllAmQ-PT5mNz9FpgtmpR7_6dP-rlhj9H_BzKRRze4&s=Fwv1vo2GuEEY4SPyWR65cOENytizVUtu7MfcVATaQIY&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AMQTYOIARZI4BALKEJ44SF3ROWQLDANCNFSM4LX5ASJQ&d=DwMCaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=SZHGQvwk5qYXensoM4g4Fr_aEfq5rxS_qQY1paompMc&m=2fllAmQ-PT5mNz9FpgtmpR7_6dP-rlhj9H_BzKRRze4&s=_QHYswlbTYZ2-jGZFwNlYos1xZ9XYHHVAsVFfXA7HOg&e=.
@weiyuan-jiang I think the I_MPI_SHM_HEAP_VSIZE
variable helps with the "unexpected failures" in the runs. The MPI_Finalize issues should be taken care of with newer MAPL with the workaround we did in MAPL_Cap
.
Note that Scott is currently testing the GCM with I_MPI_SHM_HEAP_VSIZE
. For him it's looking like anything other than zero is what's needed, but we might go with I_MPI_SHM_HEAP_VSIZE=512
since @JulesKouatchou found actual proof it's a useful number.
I've asked NCCS about their thoughts on it (note: this value is probably only needed on Haswell, so I'll probably code up the GCM's scripts to apply it only if Intel MPI + Haswell).
Also, Bill Putman has, I think, four other variable he uses at night for his runs. I think three of them might be considered "generally useful" but I'm waiting for NCCS to respond before I add them to the GCM. If they are, I'll pass them along here as well.
@JulesKouatchou Could you please update the components.yaml and Externals.cfg to reflect the versions of the repo's that you are satisfied with? (See your comment from April 18 above) Also, do we still need to use MPT to prevent crashing?
@mmanyin The last experiments that I did was about two weeks ago. I did several long runs and noticed that the code crashed after about 165 days of integration (in one job segment) even by increasing the value of I_MPI_SHM_HEAP_VSIZE. @mathomp4 mentioned that Bill is using other settings that we need to include too. In another matter, the code is still not exiting gracefully when I use Intel MPI.
Do you want be to add the versions below as default for the CTM?
@env v2.1.0 @cmake v3.0.0 @mapl v2.1.0
@mmanyin The last experiments that I did was about two weeks ago. I did several long runs and noticed that the code crashed after about 165 days of integration (in one job segment) even by increasing the value of I_MPI_SHM_HEAP_VSIZE. @mathomp4 mentioned that Bill is using other settings that we need to include too. In another matter, the code is still not exiting gracefully when I use Intel MPI.
Do you want be to add the versions below as default for the CTM?
@env v2.1.0 @cmake v3.0.0 @mapl v2.1.0
I think the graceful exit is probably due to not new enough MAPL. We fixed that we think in 2.1.3. The GCM is currently using (in master
, not yet in release):
The other Bill flags probably won't help much. He has some that I think only affect high res runs. The important ones are the I_MPI_ADJUST_ALLREDUCE, I_MPI_ADJUST_GATHERV, and the I_MPI_SHM_HEAP_VSIZE we think.
@mathomp4 and @mmanyin I used:
and also the settings I_MPI_ADJUST_ALLREDUCE, I_MPI_ADJUST_GATHERV, and I_MPI_SHM_HEAP_VSIZE (512, 1024, 2048). The code exited gracefully but still crashed at the same integration date regardless the value of I_MPI_SHM_HEAP_VSIZE.
ctm_setup:
ctm_run.j: