GEOS-ESM / GEOSgcm

GEOS Earth System Model GEOSgcm Fixture
Apache License 2.0
36 stars 13 forks source link

Testing S2S ODAS #25

Closed ehackert closed 4 years ago

ehackert commented 5 years ago

Tom asked me to test out the GITHUB version of the ODAS code. So far I set up a run where I pulled the GITHUB versions to the directory structure that the old CVS version followed. This experiment is located in

/gpfsm/dnb42/projects/p17/ehackert/geos5/exp/eh018/*. 

For the 1st pass through, I reused the model (GEOSgcm.x) from the CVS version. The error file is located in

 /gpfsm/dnb42/projects/p17/ehackert/geos5/exp/eh018/oldruns/eh018.e33633168

More recently I copied the GEOSgcm.x from the GITHUB version and reran. These results are located in

/gpfsm/dnb42/projects/p17/ehackert/geos5/exp/eh018/eh018.e33638608

Tom suggested that I change out the g5_modules in the shell so I will run this test to see if some (or all) of the errors are gone. I'll let you know the outcome.

Thanks.

eric

tclune commented 5 years ago

I've edited the original ticket to fix a typo in the paths and make the output references a bit easier to read. Here I'll include snippets of the actual error messages.

The first error log shows

mpirun -np $NP $UMD_LETKFUTILS/ocean_sponge.py $yyyy $mm $dd > ocean_sponge.out
rm temp_salt_sponge.nc
rm: cannot remove `temp_salt_sponge.nc': No such file or directory
$UMD_LETKFUTILS/ocean_iau.x -DO_SPONGE ${ODAS_dt_restore_sst}
/gpfsm/dnb42/projects/p17/ehackert/geos5/exp/eh018/ocean_das/UMD_Etc/UMD_utils/\
/ocean_iau.x: symbol lookup error: /gpfsm/dnb42/projects/p17/ehackert/geos5/exp\
/eh018/ocean_das/UMD_Etc/UMD_utils//ocean_iau.x: undefined symbol: mpi_sgi_inpl\
ace
cp temp_salt_sponge.nc $SCRDIR/INPUT/
cp: cannot stat `temp_salt_sponge.nc': No such file or directory
ln -s temp_salt_sponge.nc $SCRDIR/INPUT/temp_sponge_coeff.nc
ln -s temp_salt_sponge.nc $SCRDIR/INPUT/temp_sponge.nc

Further in on the 2nd output this type of error shows up:

@ NPES = $NX * $NY
$RUN_CMD $NPES ./GEOSgcm.x | tee geos.out
./GEOSgcm.x: error while loading shared libraries: libmpi++abi1002.so: cannot o\
pen shared object file: No such file or directory
./GEOSgcm.x: error while loading shared libraries: libmpi++abi1002.so: cannot o\
pen shared object file: No such file or directory
mathomp4 commented 5 years ago

First things first, yes, I believe some/all of this is related to g5_modules. The first experiment you pointed me to was built with Intel MPI according to this:

/gpfsm/dnb42/projects/p17/ehackert/geos5/sandbox_try4/GEOSodas/src/g5_modules

You definitely can't use an Intel MPI g5_modules with MPT executables and vice-versa.

Second, you'll want to look over your scripts for references to mpirun. With MPT, mpirun does Very Weird Things™. The easiest solution is to use esma_mpirun from the installation binary directory as it tries to auto-detect your MPI stack and use the right command. This is how things are done now, but your jobs seem to be from Heracles(?) days. At that point we hadn't quite gotten as general. You might have (in gcm_run.j):

setenv RUN_CMD "mpirun -np"

Now we do:

setenv RUN_CMD "$GEOSBIN/esma_mpirun -np"

In your testing, if your code was compiled with MPT, you'd at least need to use:

setenv RUN_CMD "mpiexec_mpt -np"

I do see one other possible excitement:

 mpirun -np $NP $UMD_LETKFUTILS/ocean_sponge.py $yyyy $mm $dd > ocean_sponge.out

Are you using mpi4py? Because support for that is tricky.

ehackert commented 5 years ago

Hi Matt, Thanks for taking a look at this. I made your suggested corrections, fixing all instances of the old g5_modules. In addition I put in the suggested mpi run correction as well for all scripts. I am submitting this now. Thanks.

eric

From: Matthew Thompson [mailto:notifications@github.com] Sent: Tuesday, July 30, 2019 8:33 AM To: GEOS-ESM/GEOSgcm GEOSgcm@noreply.github.com Cc: Hackert, Eric C. (GSFC-6101) eric.c.hackert@nasa.gov; Author author@noreply.github.com Subject: [EXTERNAL] Re: [GEOS-ESM/GEOSgcm] Testing S2S ODAS (#25)

First things first, yes, I believe some/all of this is related to g5_modules. The first experiment you pointed me to was built with Intel MPI according to this:

/gpfsm/dnb42/projects/p17/ehackert/geos5/sandbox_try4/GEOSodas/src/g5_modules

You definitely can't use an Intel MPI g5_modules with MPT executables and vice-versa.

Second, you'll want to look over your scripts for references to mpirun. With MPT, mpirun does Very Weird Things™. The easiest solution is to use esma_mpirun from the installation binary directory as it tries to auto-detect your MPI stack and use the right command. This is how things are done now, but your jobs seem to be from Heracles(?) days. At that point we hadn't quite gotten as general. You might have (in gcm_run.j):

setenv RUN_CMD "mpirun -np"

Now we do:

setenv RUN_CMD "$GEOSBIN/esma_mpirun -np"

In your testing, if your code was compiled with MPT, you'd at least need to use:

setenv RUN_CMD "mpiexec_mpt -np"

I do see one other possible excitement:

mpirun -np $NP $UMD_LETKFUTILS/ocean_sponge.py $yyyy $mm $dd > ocean_sponge.out

Are you using mpi4py? Because support for that is tricky.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GEOS-2DESM_GEOSgcm_issues_25-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAHD4H2X2QYEJ2COKQT7OZT3QCAYIRA5CNFSM4IHVQ5H2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3DZ6GY-23issuecomment-2D516398875&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=yb2yVlsWRJ_V5YW1Cxf7FHw-jq0WLSHctPC4KYrpUmM&m=ugq4J9GwLk3Y_mDu3RnD4zmkzfGxlvkbx_RNLa1_dN8&s=QT3zTvytetAwHo40Z75Z5u5obOIgBHYETNIuvpncrho&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AHD4H2WOTMVC3U272VEQ23TQCAYIRANCNFSM4IHVQ5HQ&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=yb2yVlsWRJ_V5YW1Cxf7FHw-jq0WLSHctPC4KYrpUmM&m=ugq4J9GwLk3Y_mDu3RnD4zmkzfGxlvkbx_RNLa1_dN8&s=NFlxDSGuEssm5vAna8LFDMMNN_Zfwek9-Vev3uUIrRA&e=.

ehackert commented 5 years ago

Hi Matt,

I tried replacing all instances of g5_modules with the Git version. In addition, I replaced all the mpi commands with your suggestions. Finally, I replaced the read_merra2_bcs.so with the Git version since it looked like it was choking there. Now the code is complaining about a plotting call in ocean_sponge.py (see eh018.e 33728536 for details). Also the model is bombing immediately. Any help you can suggest would be appreciated. Thanks.

Eric

From: Matthew Thompson [mailto:notifications@github.com] Sent: Tuesday, July 30, 2019 8:33 AM To: GEOS-ESM/GEOSgcm GEOSgcm@noreply.github.com Cc: Hackert, Eric C. (GSFC-6101) eric.c.hackert@nasa.gov; Author author@noreply.github.com Subject: [EXTERNAL] Re: [GEOS-ESM/GEOSgcm] Testing S2S ODAS (#25)

First things first, yes, I believe some/all of this is related to g5_modules. The first experiment you pointed me to was built with Intel MPI according to this:

/gpfsm/dnb42/projects/p17/ehackert/geos5/sandbox_try4/GEOSodas/src/g5_modules

You definitely can't use an Intel MPI g5_modules with MPT executables and vice-versa.

Second, you'll want to look over your scripts for references to mpirun. With MPT, mpirun does Very Weird Things™. The easiest solution is to use esma_mpirun from the installation binary directory as it tries to auto-detect your MPI stack and use the right command. This is how things are done now, but your jobs seem to be from Heracles(?) days. At that point we hadn't quite gotten as general. You might have (in gcm_run.j):

setenv RUN_CMD "mpirun -np"

Now we do:

setenv RUN_CMD "$GEOSBIN/esma_mpirun -np"

In your testing, if your code was compiled with MPT, you'd at least need to use:

setenv RUN_CMD "mpiexec_mpt -np"

I do see one other possible excitement:

mpirun -np $NP $UMD_LETKFUTILS/ocean_sponge.py $yyyy $mm $dd > ocean_sponge.out

Are you using mpi4py? Because support for that is tricky.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GEOS-2DESM_GEOSgcm_issues_25-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAHD4H2X2QYEJ2COKQT7OZT3QCAYIRA5CNFSM4IHVQ5H2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3DZ6GY-23issuecomment-2D516398875&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=yb2yVlsWRJ_V5YW1Cxf7FHw-jq0WLSHctPC4KYrpUmM&m=ugq4J9GwLk3Y_mDu3RnD4zmkzfGxlvkbx_RNLa1_dN8&s=QT3zTvytetAwHo40Z75Z5u5obOIgBHYETNIuvpncrho&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AHD4H2WOTMVC3U272VEQ23TQCAYIRANCNFSM4IHVQ5HQ&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=yb2yVlsWRJ_V5YW1Cxf7FHw-jq0WLSHctPC4KYrpUmM&m=ugq4J9GwLk3Y_mDu3RnD4zmkzfGxlvkbx_RNLa1_dN8&s=NFlxDSGuEssm5vAna8LFDMMNN_Zfwek9-Vev3uUIrRA&e=.

mathomp4 commented 5 years ago

@ehackert

Looking at /gpfsm/dnb42/projects/p17/ehackert/geos5/exp/eh018/eh018.o33728536 my guess is that it's because what ever CAP.rc.tmpl or the like you are using still has GCS as the root:

MAPLROOT_COMPNAME: GCS
        ROOT_NAME: GCS

This was changed when we moved to Github to be GCM:

MAPLROOT_COMPNAME: GCM
        ROOT_NAME: GCM

Try making that change and things might go farther.