NOAA-EMC / hpc-stack

Create a software stack for HPC's
GNU Lesser General Public License v2.1
30 stars 36 forks source link

[INSTALL] Library modules to support GFSv16.2.0 on Hera/Orion #379

Open KateFriedman-NOAA opened 2 years ago

KateFriedman-NOAA commented 2 years ago

In order to support the new GFSv16.2.0 (WCOSS2 port version) on Hera and Orion we need the same library module versions available. Below I list the versions that are currently being used in the new operational GFSv16.2.0 and which ones are missing on Hera/Orion.

Which software (and version) in the stack would you like installed?

Hera & Orion:

Which machines would you like to have the software installed?

Hera, Orion

Additional context

Here are the build.ver module versions for GFSv16.2.0: https://github.com/NOAA-EMC/global-workflow/blob/feature/ops-wcoss2/versions/build.ver

Refs: https://github.com/NOAA-EMC/global-workflow/issues/639

KateFriedman-NOAA commented 2 years ago

@GeorgeGayno-NOAA Please use the netcdf/4.7.4 version and not the parallel version anymore. And yes, please use the hpc-stack version. For supporting GFSv16.2.0 on Hera/Orion we will all be using these special hpc-stack-gfsv16 hpc-stack installs that Hang and Kyle are installing for us now. WCOSS2 ops will remain as is. GFSv17 will use the newer/non-special hpc-stack installs. Thanks for checking!

WenMeng-NOAA commented 2 years ago

The UPP executable is able to be built with hpc-intel/2022.1.2 and hpc-impi/2022.1.2 on Hera.

KateFriedman-NOAA commented 2 years ago

@Hang-Lei-NOAA I see wrf_io/1.2.0 but we're setting export wrf_io_ver=1.1.1 elsewhere (available on WCOSS2 and Orion currently). Can we get wrf_io/1.1.1 in the Hera hps-stack-gfsv16 stack or do you know if wrf_io/1.2.0 will work similarly?

All of the other needed module versions that global-workflow loads appear to be available. Thanks!

kgerheiser commented 2 years ago

@KateFriedman-NOAA I would expect wrf_io v1.2.0 to work.

v1.1.1 was updated to v1.2.0 because the GSI build wasn't working with v1.1.1. We fixed that, and released 1.2.0 which is compatible with both UPP (@GeorgeVandenberghe-NOAA ran some tests) and GSI.

kgerheiser commented 2 years ago

I have also finished compiling on Orion at module use /apps/contrib/NCEP/libs/hpc-stack-gfsv16/modulefiles/stack.

In there you will find modules for hpc-intel/2022.1.2 and hpc-intel/2018.4. You can test and compare as you wish, and depending on which version we go with I'll remove the unused one.

Hang-Lei-NOAA commented 2 years ago

I could add the requested intel version for GFSv16.2 project. Is it intel 18.0.5.274?

On Wed, Mar 2, 2022 at 5:31 PM Kyle Gerheiser @.***> wrote:

I have also finished compiling on Orion at module use /apps/contrib/NCEP/libs/hpc-stack-gfsv16/modulefiles/stack.

In there you will find modules for hpc-intel/2022.1.2 and hpc-intel/2018.4. You can test and compare as you wish, and depending on which version we go with I'll remove the unused one.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/379#issuecomment-1057459721, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWSMFBB2PGER3M6PBLD72LU57T3BANCNFSM5N3IETPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

Hang-Lei-NOAA commented 2 years ago

@KateFriedman-NOAA https://github.com/KateFriedman-NOAA I have added an installation based on intel 18.0.5.274 under the same location : /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack-gfsv16/ You will see both intel-18.0.5.274 and intel-2022.1.2. Please decide which works best for you.

In addition to this, I will finish the update of the official installation on /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack Everything in hpc-stack-gfsv16 will also be included in the official installation. But the official installation will include more old and new versions.

Btw, if anyone is using crtm/2.4.0, we are going to do a correction for some fix files. This has been examined in the past few days. We just finished the test with GSI on this part. The installation on hera has been updated this afternoon, and others will be updated soon.

On Wed, Mar 2, 2022 at 6:30 PM Hang Lei - NOAA Affiliate @.***> wrote:

I could add the requested intel version for GFSv16.2 project. Is it intel 18.0.5.274?

On Wed, Mar 2, 2022 at 5:31 PM Kyle Gerheiser @.***> wrote:

I have also finished compiling on Orion at module use /apps/contrib/NCEP/libs/hpc-stack-gfsv16/modulefiles/stack.

In there you will find modules for hpc-intel/2022.1.2 and hpc-intel/2018.4. You can test and compare as you wish, and depending on which version we go with I'll remove the unused one.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/379#issuecomment-1057459721, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWSMFBB2PGER3M6PBLD72LU57T3BANCNFSM5N3IETPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

WenMeng-NOAA commented 2 years ago

UPP still uses crtm/2.3.0 for GFS V16.2.

KateFriedman-NOAA commented 2 years ago

if anyone is using crtm/2.4.0, we are going to do a correction for some fix files

Noted, thanks! We're still going to be using crtm/2.3.0 for the GFSv16.2.0 system but it's good to have the newer one for GFSv17 work.

I would expect wrf_io v1.2.0 to work.

Ok good, thanks @kgerheiser !

I have added an installation based on intel 18.0.5.274 under the same location

Thanks @Hang-Lei-NOAA , appreciate you installing that on Hera! Since this install is for supporting the v16.2.0 and should match the ops version I prefer to stick with the 2018 install. I will test the 2018 install for the g-w builds and report any issues.

I have also finished compiling on Orion at module use /apps/contrib/NCEP/libs/hpc-stack-gfsv16/modulefiles/stack. In there you will find modules for hpc-intel/2022.1.2 and hpc-intel/2018.4. You can test and compare as you wish, and depending on which version we go with I'll remove the unused one.

Thanks @kgerheiser ! Appreciate you installing the 2022 version in case we needed it.

Given my comment above I prefer to stick with the 2018 install on Orion. @junwang-noaa @MichaelLueken-NOAA @WenMeng-NOAA @GeorgeGayno-NOAA @HelinWei-NOAA @YaliMao-NOAA please let me know of any objections to this decision to stick with the 2018 install in the hpc-stack-gfsv16 copies on both Hera and Orion. Please also test the 2018 install on Hera and let Hang know of any issues. I'll be doing the same for the workflow side and will be ready to start testing your updated v16.2.0 components on Hera/Orion/WCOSS2 when you're ready. Friendly reminder that we need to wrap up support work for v16.2.0 on Hera/Orion by the end of March (when the older pre-hpc-stack library installs get removed). Thanks!

@Hang-Lei-NOAA @kgerheiser We'll of course want to move to intel-2022.1.2 for the develop branch and the GFSv17 components. That will be available via the usual official hpc-stack installs, right? Double checking. Thanks!

KateFriedman-NOAA commented 2 years ago

Alrighty, just spoke with @arunchawla-NOAA offline and am pulling back on my request to stick with the 2018 versions.

@junwang-noaa @MichaelLueken-NOAA @WenMeng-NOAA @GeorgeGayno-NOAA @HelinWei-NOAA @YaliMao-NOAA Let's all try to build our respective v16.2.0 components on Hera/Orion using the newer 2022 intel version that @Hang-Lei-NOAA and @kgerheiser installed for us and see if we run into any issues. @Hang-Lei-NOAA @kgerheiser Please keep the 2018 version for now in case we find a need for it in supporting GFSv16.2.0 on Hera/Orion. Apologies for the back and forth on this! Thanks all!

Hang-Lei-NOAA commented 2 years ago

@Kate Friedman - NOAA Federal @.***> no worries. We are here to support what you guys need.

On Thu, Mar 3, 2022 at 11:06 AM Kate Friedman @.***> wrote:

Alrighty, just spoke with @arunchawla-NOAA https://github.com/arunchawla-NOAA offline and am pulling back on my request to stick with the 2018 versions.

@junwang-noaa https://github.com/junwang-noaa @MichaelLueken-NOAA https://github.com/MichaelLueken-NOAA @WenMeng-NOAA https://github.com/WenMeng-NOAA @GeorgeGayno-NOAA https://github.com/GeorgeGayno-NOAA @HelinWei-NOAA https://github.com/HelinWei-NOAA @YaliMao-NOAA https://github.com/YaliMao-NOAA Let's all try to build our respective v16.2.0 components on Hera/Orion using the newer 2022 intel version that @Hang-Lei-NOAA https://github.com/Hang-Lei-NOAA and @kgerheiser https://github.com/kgerheiser installed for us and see if we run into any issues. @Hang-Lei-NOAA https://github.com/Hang-Lei-NOAA @kgerheiser https://github.com/kgerheiser Please keep the 2018 version for now in case we find a need for it in supporting GFSv16.2.0 on Hera/Orion. Apologies for the back and forth on this! Thanks all!

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/379#issuecomment-1058202391, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWSMFFPAEWY6HOUWGXBZGTU6DPM3ANCNFSM5N3IETPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

WenMeng-NOAA commented 2 years ago

@KateFriedman-NOAA @Hang-Lei-NOAA @kgerheiser I have been testing UPP on Hera and see the following error from wgrib2/2.0.7:

0.001 + wgrib2 tmpfile1_006_2 -set_grib_type same -new_grid_winds earth -new_grid_interpolation bilinear -if ':(CSNOW|CRAIN|CFRZR|CICEP|ICSEV):' -new_grid_interpolation neighbor -fi -set_bitmap 1 -set_grib_max_bits 16 -if ':(APCP|ACPCP|PRATE|CPRAT):' -set_grib_max_bits 25 -fi -if ':(APCP|ACPCP|PRATE|CPRAT|DZDT):' -new_grid_interpolation budget -fi -new_grid latlon 0:1440:0.25 90:721:-0.25 pgb2file_006_2_0p25 -new_grid latlon 0:360:1.0 90:181:-1.0 pgb2file_006_2_1p0 -new_grid latlon 0:720:0.5 90:361:-0.5 pgb2file_006_2_0p5
**IPOLATES package is not installed**
0.010 + err=8
0.010 + export err

Can you take a look at it?

kgerheiser commented 2 years ago

@WenMeng-NOAA that's a configuration issue on our end. I had set wgrib2 to build with ip2 (not ip).

WenMeng-NOAA commented 2 years ago

@KateFriedman-NOAA @kgerheiser @Hang-Lei-NOAA I have been testing UPP on Orion and have runtime failure as"

1.030 + srun /home/wmeng/ovp/ncep_post/post_gfsv16_hpc/UPP/exec/ncep_post
1.032 + 0< itag 1> outpost_gfs_2021082406_postcntrl_gfs.xml
[Orion-11-72:331100:0:331100]      thread.c:225  Assertion `ucs_async_thread_global_context.thread != NULL' failed
[Orion-11-72:331103:0:331103]      thread.c:225  Assertion `ucs_async_thread_global_context.thread != NULL' failed
[Orion-11-72:331104:0:331104]      thread.c:225  Assertion `ucs_async_thread_global_context.thread != NULL' failed
[Orion-11-72:331107:0:331107]      thread.c:225  Assertion `ucs_async_thread_global_context.thread != NULL' failed
==== backtrace (tid: 331101) ====
 0 0x000000000004aa20 ucs_fatal_error_message()  ???:0
 1 0x000000000004abc5 ucs_fatal_error_format()  ???:0
 2 0x0000000000042257 ucs_async_pipe_drain()  ???:0
 3 0x000000000004232a ucs_async_pipe_drain()  ???:0
 4 0x00000000000409ca ucs_async_add_timer()  ???:0
 5 0x000000000003b813 uct_ud_iface_complete_init()  ???:0
 6 0x000000000003ee1c uct_ud_verbs_ep_t_delete()  ???:0
 7 0x000000000003f025 uct_ud_verbs_ep_t_delete()  ???:0
 8 0x000000000000d8cd uct_iface_open()  ???:0
 9 0x000000000001c8a0 ucp_worker_iface_open()  ???:0
10 0x000000000001cb5f ucp_worker_iface_init()  ???:0
11 0x000000000001de3a ucp_worker_create()  ???:0
12 0x000000000000a52f mlx_ep_open()  osd.c:0
13 0x00000000006bbca1 fi_endpoint()  /p/pdsd/scratch/Uploads/IMPI/other/software/libfabric/linux/v1.9.0/include/rdma/fi_endpoint.h:164
14 0x00000000006bbca1 create_endpoint()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_init.c:2593
15 0x00000000006c24b0 MPIDI_OFI_mpi_init_hook()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_init.c:1898
16 0x00000000002109dd MPID_Init()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_init.c:1307

The loaded modules for runtime are:

module purge
module use /apps/contrib/NCEP/libs/hpc-stack-gfsv16/modulefiles/stack
module load hpc/1.2.0
module load hpc-intel/2022.1.2
module load hpc-impi/2022.1.2
module load netcdf/4.7.4
module load hdf5/1.10.6
module load prod_util/1.2.2
module load grib_util/1.2.3
module load crtm/2.3.0
module load wgrib2/2.0.7

Please advise the fix.

kgerheiser commented 2 years ago

That's an ugly bug.Looks like it's coming from MPI. Is that with 2018 or 2021? I see it coming from 2022. And could I ask you to try the opposite compiler with module swap hpc-intel/2022.1.2 hpc-intel/2018.4

WenMeng-NOAA commented 2 years ago

@kgerheiser Would I swap hpc-intel from 2022.1.2 to 2018 for both compiling and runtime?

kgerheiser commented 2 years ago

Yes. You can just load the hpc-intel/2018.4 compiler as you normally would. Everything besides MPI is the same underneath.

HelinWei-NOAA commented 2 years ago

@KateFriedman-NOAA GLDAS was built successfully on Hera/Orion using the newer 2022 intel version.

Alrighty, just spoke with @arunchawla-NOAA offline and am pulling back on my request to stick with the 2018 versions.

@junwang-noaa @MichaelLueken-NOAA @WenMeng-NOAA @GeorgeGayno-NOAA @HelinWei-NOAA @YaliMao-NOAA Let's all try to build our respective v16.2.0 components on Hera/Orion using the newer 2022 intel version that @Hang-Lei-NOAA and @kgerheiser installed for us and see if we run into any issues. @Hang-Lei-NOAA @kgerheiser Please keep the 2018 version for now in case we find a need for it in supporting GFSv16.2.0 on Hera/Orion. Apologies for the back and forth on this! Thanks all!

WenMeng-NOAA commented 2 years ago

@kgerheiser UPP is working with intel/2018.4 on Orion. But I saw the same IPOLATES package issue for wgrib2/2.0.7.

KateFriedman-NOAA commented 2 years ago

@Hang-Lei-NOAA @kgerheiser I am able to successfully build the GFSv16.2 global-workflow-owned codes on both Hera and Orion using the new hpc-stack-gfsv16 2022 intel stack install (hpc/1.2.0, hpc-intel/2022.1.2, hpc-impi/2022.1.2). These execs:

enkf_chgres_recenter.x
enkf_chgres_recenter_nc.x
fbwndgfs
fv3nc2nemsio.x
gaussian_sfcanl.exe
gfs_bufr
regrid_nemsio
supvit
syndat_qctropcy
syndat_maksynrc
syndat_getjtbul
tave.x
tocsbufr
vint.x

I'm not using wgrib2/2.0.7 in any of those builds though so I can't comment on @WenMeng-NOAA's issue.

Note, if one of the GFS components needs to use the 2018 intel then all of them have to too...this is because whatever I set from the workflow level (via versions/build.ver and versions/$target.ver) overrides any defaults in the components (that support standalone) and forces the components to build with the same versions. This was one of the WCOSS2 port requirements from NCO so we have the same versions from top down in the application. Also being enforced at runtime via versions/run.ver and via the $target.ver files that set specific things for the whole system on specific machines (e.g. orion.ver).

kgerheiser commented 2 years ago

The wgrib2 problem is fixed with #404, but it's concerning that there's an MPI failure with Intel 2022. @WenMeng-NOAA could I ask you to try the same test on Hera with both versions of Intel.

Hang-Lei-NOAA commented 2 years ago

wgrib2/2.0.7 has been updated on hera. Should be fine now.

On Thu, Mar 3, 2022 at 1:36 PM Kate Friedman @.***> wrote:

@Hang-Lei-NOAA https://github.com/Hang-Lei-NOAA @kgerheiser https://github.com/kgerheiser I am able to successfully build the GFSv16.2 global-workflow-owned codes on both Hera and Orion using the new hpc-stack-gfsv16 2022 intel stack install (hpc/1.2.0, hpc-intel/2022.1.2, hpc-impi/2022.1.2). These execs:

enkf_chgres_recenter.x enkf_chgres_recenter_nc.x fbwndgfs fv3nc2nemsio.x gaussian_sfcanl.exe gfs_bufr regrid_nemsio supvit syndat_qctropcy syndat_maksynrc syndat_getjtbul tave.x tocsbufr vint.x

I'm not using wgrib2/2.0.7 in any of those builds though so I can't comment on @WenMeng-NOAA https://github.com/WenMeng-NOAA's issue.

Note, if one of the GFS components needs to use the 2018 intel then all of them have to too...this is because whatever I set from the workflow level (via versions/build.ver and versions/$target.ver) overrides any defaults in the components (that support standalone) and forces the components to build with the same versions. This was one of the WCOSS2 port requirements from NCO so we have the same versions from top down in the application. Also being enforced at runtime via versions/run.ver and via the $target.ver files that set specific things for the whole system on specific machines (e.g. orion.ver).

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/379#issuecomment-1058363467, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWSMFFPBQ7VWJNOGXXEYT3U6EBDRANCNFSM5N3IETPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

GeorgeGayno-NOAA commented 2 years ago

@kgerheiser and @Hang-Lei-NOAA I am trying to compile a program using the stack version of netcdf (4.7.4). At the link step, there are numerous undefined references. This happens on both Orion and Hera. Here is an example:

/apps/contrib/NCEP/libs/hpc-stack-gfsv16/intel-2018.4/impi-2018.4/netcdf/4.7.4/lib/libnetcdff.a(nf_attio.o): In function `nf_put_att_text_':
nf_attio.F90:(.text+0xae): undefined reference to `nc_put_att_text'
/apps/contrib/NCEP/libs/hpc-stack-gfsv16/intel-2018.4/impi-2018.4/netcdf/4.7.4/lib/libnetcdff.a(nf_attio.o): In function `nf_put_att_text_a_':
nf_attio.F90:(.text+0x17e): undefined reference to `nc_put_att_text'
/apps/contrib/NCEP/libs/hpc-stack-gfsv16/intel-2018.4/impi-2018.4/netcdf/4.7.4/lib/libnetcdff.a(nf_attio.o): In function `nf_put_att_int1_':
nf_attio.F90:(.text+0x254): undefined reference to `nc_put_att_schar'
/apps/contrib/NCEP/libs/hpc-stack-gfsv16/intel-2018.4/impi-2018.4/netcdf/4.7.4/lib/libnetcdff.a(nf_attio.o): In function `nf_put_att_int2_':
nf_attio.F90:(.text+0x324): undefined reference to `nc_put_att_short'
WenMeng-NOAA commented 2 years ago

@GeorgeGayno-NOAA You might try: NETCDF_LDFLAGS = -L$(NETCDF)/lib -lnetcdff -lnetcdf -L$(HDF5_LIBRARIES) -lhdf5_hl -lhdf5 $(ZLIB_LIB)

kgerheiser commented 2 years ago

@WenMeng-NOAA's solution looks correct. @GeorgeGayno-NOAA guessing it's the link order.

KateFriedman-NOAA commented 2 years ago

You might try: NETCDF_LDFLAGS = -L$(NETCDF)/lib -lnetcdff -lnetcdf -L$(HDF5_LIBRARIES) -lhdf5_hl -lhdf5 $(ZLIB_LIB)

I second what @WenMeng-NOAA said, I had to add something similar for a few g-w codes on Hera/Orion.

GeorgeVandenberghe-NOAA commented 2 years ago

Okay, checking this on Jet, how do I switch out the compiler in these modules

module use /lfs4/HFIP/hfv3gfs/nwprod/hpc-stack/libs/modulefiles/stack module load hpc/1.1.0 module load hpc-intel/18.0.5.274 module load hpc-impi/2018.4.274

module load jasper/2.0.22 module load zlib/1.2.11 module load png/1.6.35

module load hdf5/1.10.6 module load netcdf/4.7.4

module load bacio/2.4.1 module load crtm/2.3.0 module load g2/3.4.1 module load g2tmpl/1.10.0 module load ip/3.3.3 module load nemsio/2.5.2 module load sfcio/1.4.1 module load sigio/2.3.2 module load sp/2.3.3 module load w3nco/2.4.1 module load w3emc/2.7.3 module load wrf_io/1.1.1

I don't have a 2022 level stack but the compiler is the only thing that should need changing, libraries compiled with a lower level of the compiler are upward compatible with newer levels of the compiler (the reverse is not true, newer libraries generally break older compilers)

If this is no longer possible and we have to debug the entire stack as a single interconnected interweaved logical object, this is going to take a lot longer to run down.

GeorgeVandenberghe-NOAA commented 2 years ago

My above comment applies to Jet.

WenMeng-NOAA commented 2 years ago

@kgerheiser The wgrib2/2.0.7 issue was solved on Hera but not on Orion. I use intel/2018 on Orion.

kgerheiser commented 2 years ago

@WenMeng-NOAA the wgrib2 fix has been applied on Orion.

@GeorgeVandenberghe-NOAA what are you trying to test on Jet? After you load the hpc-stack modules you can load intel/2022.1.2, and I think that should do what you want. You'll still be able to use intel/2018 modules but also be able to use Intel 2022 yourself.

WenMeng-NOAA commented 2 years ago

Here are my UPP testing status: Hera: working with both intel/2022 and intel/18.0.5.274 Orion: working with intel/2018.4

kgerheiser commented 2 years ago

@WenMeng-NOAA is it easy to reproduce the error you're seeing? I would like to try running it. I've also put in a ticket to Orion admins.

GeorgeVandenberghe-NOAA commented 2 years ago

I tried switching compiler and impi levels to the 2022 versions on Jet and rebuilt UPP. It ran fine with both the new impi and the new intel compiler in the runtime environment. The old impi was in the build environment but I think those calls are shared and the runtime envionment is what's relevant. Point is IT DOES NOT LOOK LIKE THE ORION FAILURES with the 22 stack are occurring with impi/22 or intel/22 so it isn't a fundamental MPI issue and my early fears were unwarranted.

WenMeng-NOAA commented 2 years ago

@WenMeng-NOAA is it easy to reproduce the error you're seeing? I would like to try running it. I've also put in a ticket to Orion admins.

@kgerheiser The failure is reproducible. My UPP version is at /home/wmeng/ovp/ncep_post/post_gfsv16_hpc/UPP on Orion. The runtime log is /home/wmeng/ovp/ncep_post/post_gfsv16_hpc/UPP/out.post.fv3gfs. The job card is /home/wmeng/ovp/ncep_post/post_gfsv16_hpc/UPP/run_post_fv3gfsv16_ORION.sh.

kgerheiser commented 2 years ago

The failure is coming from a call to MPI_INIT

https://github.com/NOAA-EMC/UPP/blob/30fcea8fcb8e753ad0c251ff21a7d6b353629686/sorc/ncep_post.fd/SETUP_SERVERS.f#L80

libpthread-2.17.s  00007DD96840B5D0  Unknown               Unknown  Unknown
libc-2.17.so       00007DD9679422C7  gsignal               Unknown  Unknown
libc-2.17.so       00007DD9679439B8  abort                 Unknown  Unknown
libucs.so.0.0.0    00007DD8C2649A25  ucs_fatal_error_m     Unknown  Unknown
libucs.so.0.0.0    00007DD8C2649BC5  Unknown               Unknown  Unknown
libucs.so.0.0.0    00007DD8C2641257  Unknown               Unknown  Unknown
libucs.so.0.0.0    00007DD8C264132A  Unknown               Unknown  Unknown
libucs.so.0.0.0    00007DD8C263F9CA  ucs_async_add_tim     Unknown  Unknown
libuct_ib.so.0.0.  00007DD8C21CC813  uct_ud_iface_comp     Unknown  Unknown
libuct_ib.so.0.0.  00007DD8C21CFE1C  Unknown               Unknown  Unknown
libuct_ib.so.0.0.  00007DD8C21D0025  Unknown               Unknown  Unknown
libuct.so.0.0.0    00007DD8C296B8CD  uct_iface_open        Unknown  Unknown
libucp.so.0.0.0    00007DD8C2DB68A0  ucp_worker_iface_     Unknown  Unknown
libucp.so.0.0.0    00007DD8C2DB6B5F  Unknown               Unknown  Unknown
libucp.so.0.0.0    00007DD8C2DB7E3A  ucp_worker_create     Unknown  Unknown
libmlx-fi.so       00007DD8C2FF452F  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007DD9690DFCA1  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007DD9690E64B0  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007DD968C349DD  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007DD968F4E1A3  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007DD968F4D71B  MPI_Init              Unknown  Unknown
libmpifort.so.12.  00007DD96A34585B  PMPI_INIT             Unknown  Unknown
ncep_post          0000000000707CF0  setup_servers_             79  SETUP_SERVERS.f
ncep_post          00000000007AB773  MAIN__                    187  WRFPOST.f
ncep_post          000000000040B762  Unknown               Unknown  Unknown
libc-2.17.so       00007DD96792E495  __libc_start_main     Unknown  Unknown
ncep_post          000000000040B669  Unknown               Unknown  Unknown
forrtl: error (76): Abort trap signal
kgerheiser commented 2 years ago

@WenMeng-NOAA the Orion sysadmins found a typo in their Intel/2022.1.2 modulefile. Could you give it a rebuild and try again?

WenMeng-NOAA commented 2 years ago

@kgerheiser With rebuilt UPP executable, I still got failure at runtime. Here are runtime loaded modules:

module purge
module use /apps/contrib/NCEP/libs/hpc-stack-gfsv16/modulefiles/stack
module load hpc/1.2.0
module load hpc-intel/2022.1.2
module load hpc-impi/2022.1.2
module load netcdf/4.7.4
module load hdf5/1.10.6
module load prod_util/1.2.2
module load grib_util/1.2.3
module load crtm/2.3.0
module load wgrib2/2.0.7
module list

Please let me know anything I should try.

GeorgeVandenberghe-NOAA commented 2 years ago

Try compiling and running the code below with the post build environment

     program simple
       use mpi
      integer*8 size,s1,s2
     real, allocatable ::  a(:,:),b(:,:)
     call mpi_init(ier)
     call mpi_comm_size(mpi_comm_world,nsize,ier)
     call mpi_comm_rank(mpi_comm_world,nrank,ier)
     mbyte=262144
     mco=500
     allocate(a(mbyte,mco))
     allocate(b(mbyte,mco))
      s1=size(a)/262144.
      s2=size(b)/262144.
         print 1000,' allocated two arrays of size',s1,s2

1000 format(a30,2i17) a=5 b=5 call sub(a) call sub(b) print *, ' MPI SIZE AND RANK',nsize,nrank call mpi_finalize(ier) stop end subroutine sub(a) return end

On Fri, Mar 4, 2022 at 10:18 AM WenMeng-NOAA @.***> wrote:

@kgerheiser https://github.com/kgerheiser With rebuilt UPP executable, I still got failure at runtime. Here are runtime loaded modules:

module purge module use /apps/contrib/NCEP/libs/hpc-stack-gfsv16/modulefiles/stack module load hpc/1.2.0 module load hpc-intel/2022.1.2 module load hpc-impi/2022.1.2 module load netcdf/4.7.4 module load hdf5/1.10.6 module load prod_util/1.2.2 module load grib_util/1.2.3 module load crtm/2.3.0 module load wgrib2/2.0.7 module list

Please let me know anything I should try.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/379#issuecomment-1059253739, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FVIJZZI2NEO3SMQM23U6ISUPANCNFSM5N3IETPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

kgerheiser commented 2 years ago

@GeorgeVandenberghe-NOAA the test program ran successfully.

srun -A nems -N 8 --ntasks-per-node=12 ./test

kgerheiser commented 2 years ago

@WenMeng-NOAA how do I compile the post executable? There are a bunch of different makefiles. I was able to run your test, but I would like to try adding some debug flags.

WenMeng-NOAA commented 2 years ago

@kgerheiser You run the build script under sorc/build_ncep_post.sh. The makefile for GFSV16 on Orion is ncep_post.fd/makefile_module_hpc.

kgerheiser commented 2 years ago

No luck with debug flags. I'm going to re-build everything from the ground up (in a personal location) to see if that helps.

kgerheiser commented 2 years ago

Received some more feedback from Orion admins and they found another bug in their module. Going to test against a new build. This bug should be reproducible. It fails just several lines into the program and does nothing but call MPI_Init().

GeorgeVandenberghe-NOAA commented 2 years ago

The following post build works fine on Orion and the built executable runs to completion when the module swapping in bold below is also done in the runscript. So I haven't been able to replicate the error on Orion.

%Module

Wen Meng 01/2021, Set up config. with the hpc-stack NCEPLIBS.

##############################################################################

proc ModulesHelp { } { puts stderr "Loads modules required for building upp" } module-whatis "Loads UPP prerequisites on Orion"

module load cmake/3.17.3

module use /apps/contrib/NCEP/libs/hpc-stack/modulefiles/stack module load hpc/1.1.0 module load hpc-intel/2018.4 module load hpc-impi/2018.4

module load jasper/2.0.22 module load zlib/1.2.11 module load png/1.6.35

module load hdf5/1.10.6 module load netcdf/4.7.4

module load bacio/2.4.1 module load crtm/2.3.0 module load g2/3.4.1 module load g2tmpl/1.10.0 module load ip/3.3.3 module load nemsio/2.5.2 module load sfcio/1.4.1 module load sigio/2.3.2 module load sp/2.3.3 module load w3nco/2.4.1 module load w3emc/2.7.3 module load wrf_io/1.1.1

module unload intelmodule load intel/2022.1.2module unload impimodule load impi/2022.1.2

On Mon, Mar 7, 2022 at 9:54 AM Kyle Gerheiser @.***> wrote:

Received some more feedback from Orion admins and they found another bug in their module. Going to test against a new build. This bug should be reproducible. It fails just several lines into the program and does nothing but call MPI_Init().

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/379#issuecomment-1060773485, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FSOASO3AB44ONJPG4LU6YKAVANCNFSM5N3IETPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

kgerheiser commented 2 years ago

@GeorgeVandenberghe-NOAA I was using the repo @WenMeng-NOAA pointed me to (/work/noaa/stmp/gkyle/post_gfsv16_hpc), and using the run script /work/noaa/stmp/gkyle/post_gfsv16_hpc/UPP/run_post_fv3gfsv16_ORION.sh. I am not familiar with how post works, but I was able to get the error when using that run script.

WenMeng-NOAA commented 2 years ago

@kgerheiser and @GeorgeVandenberghe-NOAA From my testing of GFS V16.2 on Orion, UPP is working with interl/2018 but not intel/2022.1.2. I also tested on Hera, UPP is working with both interl/18 and interl/2022/1.2 on Hera.

kgerheiser commented 2 years ago

@WenMeng-NOAA is it possible to distill the run script to just the part where it runs the executable? Ideally, I could just run srun ncep_post from a directory that contains the necessary input.

kgerheiser commented 2 years ago

I ran the executable directly (srun -A nems -N 6 --ntasks-per-node=12 ./ncep_post) and it gets past the MPI_init() phase fine, but when run with sbatch run_post_fv3gfsv16_ORION.sh the error occurs. I think it's something in the run script causing this.

GeorgeVandenberghe-NOAA commented 2 years ago

A run directory is on /work/noaa/noaatest/gwv/post/da. One just has to cd to the directory and execute the post.

!/bin/bash

SBATCH -J GFSV16

SBATCH -A noaatest

SBATCH -o test.out

SBATCH -e test.out

SBATCH --account=noaatest

SBATCH --nodes=3

SBATCH --tasks=20

SBATCH --cpus-per-task=1

SBATCH

SBATCH -p orion

SBATCH -t 0:20:00

module load slurm

module unload intel module load intel/2022.1.2 module unload impi module load impi/2022.1.2

export OMP_NUM_THREADS=1 export export KMP_STACKSIZE=1024m

Modify below to specify where your post executable is. The job is

intended

for the base post that you will compare your fork or branch post against

export POSTEXEC=/mnt/lfs4/HFIP/hfv3gfs/gwv/post/base/exec/upp.x export POSTEXEC=../../base/exec/upp.x export POSTEXEC=./upp.x

srun -n 60 $POSTEXEC >o 2>e

On Mon, Mar 7, 2022 at 10:41 AM Kyle Gerheiser @.***> wrote:

@GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA I was using the repo @WenMeng-NOAA https://github.com/WenMeng-NOAA pointed me to (/work/noaa/stmp/gkyle/post_gfsv16_hpc), and using the run script /work/noaa/stmp/gkyle/post_gfsv16_hpc/UPP/run_post_fv3gfsv16_ORION.sh. I am not familiar with how post works, but I was able to get the error when using that run script.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/379#issuecomment-1060824765, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FUIHE7ZSOBXAU24SG3U6YPR5ANCNFSM5N3IETPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

kgerheiser commented 2 years ago

@GeorgeVandenberghe-NOAA the directory is unreadable for me. Could you adjust the permissions?

ls: cannot open directory /work/noaa/noaatest/gwv/post/: Permission denied