Open KateFriedman-NOAA opened 2 years ago
@GeorgeGayno-NOAA Please use the netcdf/4.7.4
version and not the parallel version anymore. And yes, please use the hpc-stack version. For supporting GFSv16.2.0 on Hera/Orion we will all be using these special hpc-stack-gfsv16
hpc-stack installs that Hang and Kyle are installing for us now. WCOSS2 ops will remain as is. GFSv17 will use the newer/non-special hpc-stack installs. Thanks for checking!
The UPP executable is able to be built with hpc-intel/2022.1.2 and hpc-impi/2022.1.2 on Hera.
@Hang-Lei-NOAA I see wrf_io/1.2.0
but we're setting export wrf_io_ver=1.1.1
elsewhere (available on WCOSS2 and Orion currently). Can we get wrf_io/1.1.1
in the Hera hps-stack-gfsv16
stack or do you know if wrf_io/1.2.0
will work similarly?
All of the other needed module versions that global-workflow loads appear to be available. Thanks!
@KateFriedman-NOAA I would expect wrf_io v1.2.0 to work.
v1.1.1 was updated to v1.2.0 because the GSI build wasn't working with v1.1.1. We fixed that, and released 1.2.0 which is compatible with both UPP (@GeorgeVandenberghe-NOAA ran some tests) and GSI.
I have also finished compiling on Orion at module use /apps/contrib/NCEP/libs/hpc-stack-gfsv16/modulefiles/stack
.
In there you will find modules for hpc-intel/2022.1.2
and hpc-intel/2018.4
. You can test and compare as you wish, and depending on which version we go with I'll remove the unused one.
I could add the requested intel version for GFSv16.2 project. Is it intel 18.0.5.274?
On Wed, Mar 2, 2022 at 5:31 PM Kyle Gerheiser @.***> wrote:
I have also finished compiling on Orion at module use /apps/contrib/NCEP/libs/hpc-stack-gfsv16/modulefiles/stack.
In there you will find modules for hpc-intel/2022.1.2 and hpc-intel/2018.4. You can test and compare as you wish, and depending on which version we go with I'll remove the unused one.
— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/379#issuecomment-1057459721, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWSMFBB2PGER3M6PBLD72LU57T3BANCNFSM5N3IETPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
@KateFriedman-NOAA https://github.com/KateFriedman-NOAA I have added an installation based on intel 18.0.5.274 under the same location : /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack-gfsv16/ You will see both intel-18.0.5.274 and intel-2022.1.2. Please decide which works best for you.
In addition to this, I will finish the update of the official installation on /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack Everything in hpc-stack-gfsv16 will also be included in the official installation. But the official installation will include more old and new versions.
Btw, if anyone is using crtm/2.4.0, we are going to do a correction for some fix files. This has been examined in the past few days. We just finished the test with GSI on this part. The installation on hera has been updated this afternoon, and others will be updated soon.
On Wed, Mar 2, 2022 at 6:30 PM Hang Lei - NOAA Affiliate @.***> wrote:
I could add the requested intel version for GFSv16.2 project. Is it intel 18.0.5.274?
On Wed, Mar 2, 2022 at 5:31 PM Kyle Gerheiser @.***> wrote:
I have also finished compiling on Orion at module use /apps/contrib/NCEP/libs/hpc-stack-gfsv16/modulefiles/stack.
In there you will find modules for hpc-intel/2022.1.2 and hpc-intel/2018.4. You can test and compare as you wish, and depending on which version we go with I'll remove the unused one.
— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/379#issuecomment-1057459721, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWSMFBB2PGER3M6PBLD72LU57T3BANCNFSM5N3IETPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
UPP still uses crtm/2.3.0 for GFS V16.2.
if anyone is using crtm/2.4.0, we are going to do a correction for some fix files
Noted, thanks! We're still going to be using crtm/2.3.0
for the GFSv16.2.0 system but it's good to have the newer one for GFSv17 work.
I would expect wrf_io v1.2.0 to work.
Ok good, thanks @kgerheiser !
I have added an installation based on intel 18.0.5.274 under the same location
Thanks @Hang-Lei-NOAA , appreciate you installing that on Hera! Since this install is for supporting the v16.2.0 and should match the ops version I prefer to stick with the 2018 install. I will test the 2018 install for the g-w builds and report any issues.
I have also finished compiling on Orion at module use /apps/contrib/NCEP/libs/hpc-stack-gfsv16/modulefiles/stack. In there you will find modules for hpc-intel/2022.1.2 and hpc-intel/2018.4. You can test and compare as you wish, and depending on which version we go with I'll remove the unused one.
Thanks @kgerheiser ! Appreciate you installing the 2022 version in case we needed it.
Given my comment above I prefer to stick with the 2018 install on Orion. @junwang-noaa @MichaelLueken-NOAA @WenMeng-NOAA @GeorgeGayno-NOAA @HelinWei-NOAA @YaliMao-NOAA please let me know of any objections to this decision to stick with the 2018 install in the hpc-stack-gfsv16 copies on both Hera and Orion. Please also test the 2018 install on Hera and let Hang know of any issues. I'll be doing the same for the workflow side and will be ready to start testing your updated v16.2.0 components on Hera/Orion/WCOSS2 when you're ready. Friendly reminder that we need to wrap up support work for v16.2.0 on Hera/Orion by the end of March (when the older pre-hpc-stack library installs get removed). Thanks!
@Hang-Lei-NOAA @kgerheiser We'll of course want to move to intel-2022.1.2 for the develop branch and the GFSv17 components. That will be available via the usual official hpc-stack installs, right? Double checking. Thanks!
Alrighty, just spoke with @arunchawla-NOAA offline and am pulling back on my request to stick with the 2018 versions.
@junwang-noaa @MichaelLueken-NOAA @WenMeng-NOAA @GeorgeGayno-NOAA @HelinWei-NOAA @YaliMao-NOAA Let's all try to build our respective v16.2.0 components on Hera/Orion using the newer 2022 intel version that @Hang-Lei-NOAA and @kgerheiser installed for us and see if we run into any issues. @Hang-Lei-NOAA @kgerheiser Please keep the 2018 version for now in case we find a need for it in supporting GFSv16.2.0 on Hera/Orion. Apologies for the back and forth on this! Thanks all!
@Kate Friedman - NOAA Federal @.***> no worries. We are here to support what you guys need.
On Thu, Mar 3, 2022 at 11:06 AM Kate Friedman @.***> wrote:
Alrighty, just spoke with @arunchawla-NOAA https://github.com/arunchawla-NOAA offline and am pulling back on my request to stick with the 2018 versions.
@junwang-noaa https://github.com/junwang-noaa @MichaelLueken-NOAA https://github.com/MichaelLueken-NOAA @WenMeng-NOAA https://github.com/WenMeng-NOAA @GeorgeGayno-NOAA https://github.com/GeorgeGayno-NOAA @HelinWei-NOAA https://github.com/HelinWei-NOAA @YaliMao-NOAA https://github.com/YaliMao-NOAA Let's all try to build our respective v16.2.0 components on Hera/Orion using the newer 2022 intel version that @Hang-Lei-NOAA https://github.com/Hang-Lei-NOAA and @kgerheiser https://github.com/kgerheiser installed for us and see if we run into any issues. @Hang-Lei-NOAA https://github.com/Hang-Lei-NOAA @kgerheiser https://github.com/kgerheiser Please keep the 2018 version for now in case we find a need for it in supporting GFSv16.2.0 on Hera/Orion. Apologies for the back and forth on this! Thanks all!
— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/379#issuecomment-1058202391, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWSMFFPAEWY6HOUWGXBZGTU6DPM3ANCNFSM5N3IETPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
@KateFriedman-NOAA @Hang-Lei-NOAA @kgerheiser I have been testing UPP on Hera and see the following error from wgrib2/2.0.7:
0.001 + wgrib2 tmpfile1_006_2 -set_grib_type same -new_grid_winds earth -new_grid_interpolation bilinear -if ':(CSNOW|CRAIN|CFRZR|CICEP|ICSEV):' -new_grid_interpolation neighbor -fi -set_bitmap 1 -set_grib_max_bits 16 -if ':(APCP|ACPCP|PRATE|CPRAT):' -set_grib_max_bits 25 -fi -if ':(APCP|ACPCP|PRATE|CPRAT|DZDT):' -new_grid_interpolation budget -fi -new_grid latlon 0:1440:0.25 90:721:-0.25 pgb2file_006_2_0p25 -new_grid latlon 0:360:1.0 90:181:-1.0 pgb2file_006_2_1p0 -new_grid latlon 0:720:0.5 90:361:-0.5 pgb2file_006_2_0p5
**IPOLATES package is not installed**
0.010 + err=8
0.010 + export err
Can you take a look at it?
@WenMeng-NOAA that's a configuration issue on our end. I had set wgrib2 to build with ip2 (not ip).
@KateFriedman-NOAA @kgerheiser @Hang-Lei-NOAA I have been testing UPP on Orion and have runtime failure as"
1.030 + srun /home/wmeng/ovp/ncep_post/post_gfsv16_hpc/UPP/exec/ncep_post
1.032 + 0< itag 1> outpost_gfs_2021082406_postcntrl_gfs.xml
[Orion-11-72:331100:0:331100] thread.c:225 Assertion `ucs_async_thread_global_context.thread != NULL' failed
[Orion-11-72:331103:0:331103] thread.c:225 Assertion `ucs_async_thread_global_context.thread != NULL' failed
[Orion-11-72:331104:0:331104] thread.c:225 Assertion `ucs_async_thread_global_context.thread != NULL' failed
[Orion-11-72:331107:0:331107] thread.c:225 Assertion `ucs_async_thread_global_context.thread != NULL' failed
==== backtrace (tid: 331101) ====
0 0x000000000004aa20 ucs_fatal_error_message() ???:0
1 0x000000000004abc5 ucs_fatal_error_format() ???:0
2 0x0000000000042257 ucs_async_pipe_drain() ???:0
3 0x000000000004232a ucs_async_pipe_drain() ???:0
4 0x00000000000409ca ucs_async_add_timer() ???:0
5 0x000000000003b813 uct_ud_iface_complete_init() ???:0
6 0x000000000003ee1c uct_ud_verbs_ep_t_delete() ???:0
7 0x000000000003f025 uct_ud_verbs_ep_t_delete() ???:0
8 0x000000000000d8cd uct_iface_open() ???:0
9 0x000000000001c8a0 ucp_worker_iface_open() ???:0
10 0x000000000001cb5f ucp_worker_iface_init() ???:0
11 0x000000000001de3a ucp_worker_create() ???:0
12 0x000000000000a52f mlx_ep_open() osd.c:0
13 0x00000000006bbca1 fi_endpoint() /p/pdsd/scratch/Uploads/IMPI/other/software/libfabric/linux/v1.9.0/include/rdma/fi_endpoint.h:164
14 0x00000000006bbca1 create_endpoint() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_init.c:2593
15 0x00000000006c24b0 MPIDI_OFI_mpi_init_hook() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_init.c:1898
16 0x00000000002109dd MPID_Init() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_init.c:1307
The loaded modules for runtime are:
module purge
module use /apps/contrib/NCEP/libs/hpc-stack-gfsv16/modulefiles/stack
module load hpc/1.2.0
module load hpc-intel/2022.1.2
module load hpc-impi/2022.1.2
module load netcdf/4.7.4
module load hdf5/1.10.6
module load prod_util/1.2.2
module load grib_util/1.2.3
module load crtm/2.3.0
module load wgrib2/2.0.7
Please advise the fix.
That's an ugly bug.Looks like it's coming from MPI. Is that with 2018 or 2021? I see it coming from 2022. And could I ask you to try the opposite compiler with module swap hpc-intel/2022.1.2 hpc-intel/2018.4
@kgerheiser Would I swap hpc-intel from 2022.1.2 to 2018 for both compiling and runtime?
Yes. You can just load the hpc-intel/2018.4 compiler as you normally would. Everything besides MPI is the same underneath.
@KateFriedman-NOAA GLDAS was built successfully on Hera/Orion using the newer 2022 intel version.
Alrighty, just spoke with @arunchawla-NOAA offline and am pulling back on my request to stick with the 2018 versions.
@junwang-noaa @MichaelLueken-NOAA @WenMeng-NOAA @GeorgeGayno-NOAA @HelinWei-NOAA @YaliMao-NOAA Let's all try to build our respective v16.2.0 components on Hera/Orion using the newer 2022 intel version that @Hang-Lei-NOAA and @kgerheiser installed for us and see if we run into any issues. @Hang-Lei-NOAA @kgerheiser Please keep the 2018 version for now in case we find a need for it in supporting GFSv16.2.0 on Hera/Orion. Apologies for the back and forth on this! Thanks all!
@kgerheiser UPP is working with intel/2018.4 on Orion. But I saw the same IPOLATES package issue for wgrib2/2.0.7.
@Hang-Lei-NOAA @kgerheiser I am able to successfully build the GFSv16.2 global-workflow-owned codes on both Hera and Orion using the new hpc-stack-gfsv16
2022 intel stack install (hpc/1.2.0
, hpc-intel/2022.1.2
, hpc-impi/2022.1.2
). These execs:
enkf_chgres_recenter.x
enkf_chgres_recenter_nc.x
fbwndgfs
fv3nc2nemsio.x
gaussian_sfcanl.exe
gfs_bufr
regrid_nemsio
supvit
syndat_qctropcy
syndat_maksynrc
syndat_getjtbul
tave.x
tocsbufr
vint.x
I'm not using wgrib2/2.0.7
in any of those builds though so I can't comment on @WenMeng-NOAA's issue.
Note, if one of the GFS components needs to use the 2018 intel then all of them have to too...this is because whatever I set from the workflow level (via versions/build.ver
and versions/$target.ver
) overrides any defaults in the components (that support standalone) and forces the components to build with the same versions. This was one of the WCOSS2 port requirements from NCO so we have the same versions from top down in the application. Also being enforced at runtime via versions/run.ver
and via the $target.ver
files that set specific things for the whole system on specific machines (e.g. orion.ver
).
The wgrib2 problem is fixed with #404, but it's concerning that there's an MPI failure with Intel 2022. @WenMeng-NOAA could I ask you to try the same test on Hera with both versions of Intel.
wgrib2/2.0.7 has been updated on hera. Should be fine now.
On Thu, Mar 3, 2022 at 1:36 PM Kate Friedman @.***> wrote:
@Hang-Lei-NOAA https://github.com/Hang-Lei-NOAA @kgerheiser https://github.com/kgerheiser I am able to successfully build the GFSv16.2 global-workflow-owned codes on both Hera and Orion using the new hpc-stack-gfsv16 2022 intel stack install (hpc/1.2.0, hpc-intel/2022.1.2, hpc-impi/2022.1.2). These execs:
enkf_chgres_recenter.x enkf_chgres_recenter_nc.x fbwndgfs fv3nc2nemsio.x gaussian_sfcanl.exe gfs_bufr regrid_nemsio supvit syndat_qctropcy syndat_maksynrc syndat_getjtbul tave.x tocsbufr vint.x
I'm not using wgrib2/2.0.7 in any of those builds though so I can't comment on @WenMeng-NOAA https://github.com/WenMeng-NOAA's issue.
Note, if one of the GFS components needs to use the 2018 intel then all of them have to too...this is because whatever I set from the workflow level (via versions/build.ver and versions/$target.ver) overrides any defaults in the components (that support standalone) and forces the components to build with the same versions. This was one of the WCOSS2 port requirements from NCO so we have the same versions from top down in the application. Also being enforced at runtime via versions/run.ver and via the $target.ver files that set specific things for the whole system on specific machines (e.g. orion.ver).
— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/379#issuecomment-1058363467, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWSMFFPBQ7VWJNOGXXEYT3U6EBDRANCNFSM5N3IETPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
@kgerheiser and @Hang-Lei-NOAA I am trying to compile a program using the stack version of netcdf (4.7.4). At the link step, there are numerous undefined references. This happens on both Orion and Hera. Here is an example:
/apps/contrib/NCEP/libs/hpc-stack-gfsv16/intel-2018.4/impi-2018.4/netcdf/4.7.4/lib/libnetcdff.a(nf_attio.o): In function `nf_put_att_text_':
nf_attio.F90:(.text+0xae): undefined reference to `nc_put_att_text'
/apps/contrib/NCEP/libs/hpc-stack-gfsv16/intel-2018.4/impi-2018.4/netcdf/4.7.4/lib/libnetcdff.a(nf_attio.o): In function `nf_put_att_text_a_':
nf_attio.F90:(.text+0x17e): undefined reference to `nc_put_att_text'
/apps/contrib/NCEP/libs/hpc-stack-gfsv16/intel-2018.4/impi-2018.4/netcdf/4.7.4/lib/libnetcdff.a(nf_attio.o): In function `nf_put_att_int1_':
nf_attio.F90:(.text+0x254): undefined reference to `nc_put_att_schar'
/apps/contrib/NCEP/libs/hpc-stack-gfsv16/intel-2018.4/impi-2018.4/netcdf/4.7.4/lib/libnetcdff.a(nf_attio.o): In function `nf_put_att_int2_':
nf_attio.F90:(.text+0x324): undefined reference to `nc_put_att_short'
@GeorgeGayno-NOAA You might try: NETCDF_LDFLAGS = -L$(NETCDF)/lib -lnetcdff -lnetcdf -L$(HDF5_LIBRARIES) -lhdf5_hl -lhdf5 $(ZLIB_LIB)
@WenMeng-NOAA's solution looks correct. @GeorgeGayno-NOAA guessing it's the link order.
You might try: NETCDF_LDFLAGS = -L$(NETCDF)/lib -lnetcdff -lnetcdf -L$(HDF5_LIBRARIES) -lhdf5_hl -lhdf5 $(ZLIB_LIB)
I second what @WenMeng-NOAA said, I had to add something similar for a few g-w codes on Hera/Orion.
Okay, checking this on Jet, how do I switch out the compiler in these modules
module use /lfs4/HFIP/hfv3gfs/nwprod/hpc-stack/libs/modulefiles/stack module load hpc/1.1.0 module load hpc-intel/18.0.5.274 module load hpc-impi/2018.4.274
module load jasper/2.0.22 module load zlib/1.2.11 module load png/1.6.35
module load hdf5/1.10.6 module load netcdf/4.7.4
module load bacio/2.4.1 module load crtm/2.3.0 module load g2/3.4.1 module load g2tmpl/1.10.0 module load ip/3.3.3 module load nemsio/2.5.2 module load sfcio/1.4.1 module load sigio/2.3.2 module load sp/2.3.3 module load w3nco/2.4.1 module load w3emc/2.7.3 module load wrf_io/1.1.1
I don't have a 2022 level stack but the compiler is the only thing that should need changing, libraries compiled with a lower level of the compiler are upward compatible with newer levels of the compiler (the reverse is not true, newer libraries generally break older compilers)
If this is no longer possible and we have to debug the entire stack as a single interconnected interweaved logical object, this is going to take a lot longer to run down.
My above comment applies to Jet.
@kgerheiser The wgrib2/2.0.7 issue was solved on Hera but not on Orion. I use intel/2018 on Orion.
@WenMeng-NOAA the wgrib2 fix has been applied on Orion.
@GeorgeVandenberghe-NOAA what are you trying to test on Jet? After you load the hpc-stack modules you can load intel/2022.1.2
, and I think that should do what you want. You'll still be able to use intel/2018 modules but also be able to use Intel 2022 yourself.
Here are my UPP testing status: Hera: working with both intel/2022 and intel/18.0.5.274 Orion: working with intel/2018.4
@WenMeng-NOAA is it easy to reproduce the error you're seeing? I would like to try running it. I've also put in a ticket to Orion admins.
I tried switching compiler and impi levels to the 2022 versions on Jet and rebuilt UPP. It ran fine with both the new impi and the new intel compiler in the runtime environment. The old impi was in the build environment but I think those calls are shared and the runtime envionment is what's relevant. Point is IT DOES NOT LOOK LIKE THE ORION FAILURES with the 22 stack are occurring with impi/22 or intel/22 so it isn't a fundamental MPI issue and my early fears were unwarranted.
@WenMeng-NOAA is it easy to reproduce the error you're seeing? I would like to try running it. I've also put in a ticket to Orion admins.
@kgerheiser The failure is reproducible. My UPP version is at /home/wmeng/ovp/ncep_post/post_gfsv16_hpc/UPP on Orion. The runtime log is /home/wmeng/ovp/ncep_post/post_gfsv16_hpc/UPP/out.post.fv3gfs. The job card is /home/wmeng/ovp/ncep_post/post_gfsv16_hpc/UPP/run_post_fv3gfsv16_ORION.sh.
The failure is coming from a call to MPI_INIT
libpthread-2.17.s 00007DD96840B5D0 Unknown Unknown Unknown
libc-2.17.so 00007DD9679422C7 gsignal Unknown Unknown
libc-2.17.so 00007DD9679439B8 abort Unknown Unknown
libucs.so.0.0.0 00007DD8C2649A25 ucs_fatal_error_m Unknown Unknown
libucs.so.0.0.0 00007DD8C2649BC5 Unknown Unknown Unknown
libucs.so.0.0.0 00007DD8C2641257 Unknown Unknown Unknown
libucs.so.0.0.0 00007DD8C264132A Unknown Unknown Unknown
libucs.so.0.0.0 00007DD8C263F9CA ucs_async_add_tim Unknown Unknown
libuct_ib.so.0.0. 00007DD8C21CC813 uct_ud_iface_comp Unknown Unknown
libuct_ib.so.0.0. 00007DD8C21CFE1C Unknown Unknown Unknown
libuct_ib.so.0.0. 00007DD8C21D0025 Unknown Unknown Unknown
libuct.so.0.0.0 00007DD8C296B8CD uct_iface_open Unknown Unknown
libucp.so.0.0.0 00007DD8C2DB68A0 ucp_worker_iface_ Unknown Unknown
libucp.so.0.0.0 00007DD8C2DB6B5F Unknown Unknown Unknown
libucp.so.0.0.0 00007DD8C2DB7E3A ucp_worker_create Unknown Unknown
libmlx-fi.so 00007DD8C2FF452F Unknown Unknown Unknown
libmpi.so.12.0.0 00007DD9690DFCA1 Unknown Unknown Unknown
libmpi.so.12.0.0 00007DD9690E64B0 Unknown Unknown Unknown
libmpi.so.12.0.0 00007DD968C349DD Unknown Unknown Unknown
libmpi.so.12.0.0 00007DD968F4E1A3 Unknown Unknown Unknown
libmpi.so.12.0.0 00007DD968F4D71B MPI_Init Unknown Unknown
libmpifort.so.12. 00007DD96A34585B PMPI_INIT Unknown Unknown
ncep_post 0000000000707CF0 setup_servers_ 79 SETUP_SERVERS.f
ncep_post 00000000007AB773 MAIN__ 187 WRFPOST.f
ncep_post 000000000040B762 Unknown Unknown Unknown
libc-2.17.so 00007DD96792E495 __libc_start_main Unknown Unknown
ncep_post 000000000040B669 Unknown Unknown Unknown
forrtl: error (76): Abort trap signal
@WenMeng-NOAA the Orion sysadmins found a typo in their Intel/2022.1.2 modulefile. Could you give it a rebuild and try again?
@kgerheiser With rebuilt UPP executable, I still got failure at runtime. Here are runtime loaded modules:
module purge
module use /apps/contrib/NCEP/libs/hpc-stack-gfsv16/modulefiles/stack
module load hpc/1.2.0
module load hpc-intel/2022.1.2
module load hpc-impi/2022.1.2
module load netcdf/4.7.4
module load hdf5/1.10.6
module load prod_util/1.2.2
module load grib_util/1.2.3
module load crtm/2.3.0
module load wgrib2/2.0.7
module list
Please let me know anything I should try.
Try compiling and running the code below with the post build environment
program simple
use mpi
integer*8 size,s1,s2
real, allocatable :: a(:,:),b(:,:)
call mpi_init(ier)
call mpi_comm_size(mpi_comm_world,nsize,ier)
call mpi_comm_rank(mpi_comm_world,nrank,ier)
mbyte=262144
mco=500
allocate(a(mbyte,mco))
allocate(b(mbyte,mco))
s1=size(a)/262144.
s2=size(b)/262144.
print 1000,' allocated two arrays of size',s1,s2
1000 format(a30,2i17) a=5 b=5 call sub(a) call sub(b) print *, ' MPI SIZE AND RANK',nsize,nrank call mpi_finalize(ier) stop end subroutine sub(a) return end
On Fri, Mar 4, 2022 at 10:18 AM WenMeng-NOAA @.***> wrote:
@kgerheiser https://github.com/kgerheiser With rebuilt UPP executable, I still got failure at runtime. Here are runtime loaded modules:
module purge module use /apps/contrib/NCEP/libs/hpc-stack-gfsv16/modulefiles/stack module load hpc/1.2.0 module load hpc-intel/2022.1.2 module load hpc-impi/2022.1.2 module load netcdf/4.7.4 module load hdf5/1.10.6 module load prod_util/1.2.2 module load grib_util/1.2.3 module load crtm/2.3.0 module load wgrib2/2.0.7 module list
Please let me know anything I should try.
— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/379#issuecomment-1059253739, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FVIJZZI2NEO3SMQM23U6ISUPANCNFSM5N3IETPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
--
George W Vandenberghe
IMSG at NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
@.***
301-683-3769(work) 3017751547(cell)
@GeorgeVandenberghe-NOAA the test program ran successfully.
srun -A nems -N 8 --ntasks-per-node=12 ./test
@WenMeng-NOAA how do I compile the post executable? There are a bunch of different makefiles. I was able to run your test, but I would like to try adding some debug flags.
@kgerheiser You run the build script under sorc/build_ncep_post.sh. The makefile for GFSV16 on Orion is ncep_post.fd/makefile_module_hpc.
No luck with debug flags. I'm going to re-build everything from the ground up (in a personal location) to see if that helps.
Received some more feedback from Orion admins and they found another bug in their module. Going to test against a new build. This bug should be reproducible. It fails just several lines into the program and does nothing but call MPI_Init().
The following post build works fine on Orion and the built executable runs to completion when the module swapping in bold below is also done in the runscript. So I haven't been able to replicate the error on Orion.
##############################################################################
proc ModulesHelp { } { puts stderr "Loads modules required for building upp" } module-whatis "Loads UPP prerequisites on Orion"
module load cmake/3.17.3
module use /apps/contrib/NCEP/libs/hpc-stack/modulefiles/stack module load hpc/1.1.0 module load hpc-intel/2018.4 module load hpc-impi/2018.4
module load jasper/2.0.22 module load zlib/1.2.11 module load png/1.6.35
module load hdf5/1.10.6 module load netcdf/4.7.4
module load bacio/2.4.1 module load crtm/2.3.0 module load g2/3.4.1 module load g2tmpl/1.10.0 module load ip/3.3.3 module load nemsio/2.5.2 module load sfcio/1.4.1 module load sigio/2.3.2 module load sp/2.3.3 module load w3nco/2.4.1 module load w3emc/2.7.3 module load wrf_io/1.1.1
module unload intelmodule load intel/2022.1.2module unload impimodule load impi/2022.1.2
On Mon, Mar 7, 2022 at 9:54 AM Kyle Gerheiser @.***> wrote:
Received some more feedback from Orion admins and they found another bug in their module. Going to test against a new build. This bug should be reproducible. It fails just several lines into the program and does nothing but call MPI_Init().
— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/379#issuecomment-1060773485, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FSOASO3AB44ONJPG4LU6YKAVANCNFSM5N3IETPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
--
George W Vandenberghe
IMSG at NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
@.***
301-683-3769(work) 3017751547(cell)
@GeorgeVandenberghe-NOAA I was using the repo @WenMeng-NOAA pointed me to (/work/noaa/stmp/gkyle/post_gfsv16_hpc
), and using the run script /work/noaa/stmp/gkyle/post_gfsv16_hpc/UPP/run_post_fv3gfsv16_ORION.sh
. I am not familiar with how post works, but I was able to get the error when using that run script.
@kgerheiser and @GeorgeVandenberghe-NOAA From my testing of GFS V16.2 on Orion, UPP is working with interl/2018 but not intel/2022.1.2. I also tested on Hera, UPP is working with both interl/18 and interl/2022/1.2 on Hera.
@WenMeng-NOAA is it possible to distill the run script to just the part where it runs the executable? Ideally, I could just run srun ncep_post
from a directory that contains the necessary input.
I ran the executable directly (srun -A nems -N 6 --ntasks-per-node=12 ./ncep_post
) and it gets past the MPI_init() phase fine, but when run with sbatch run_post_fv3gfsv16_ORION.sh
the error occurs. I think it's something in the run script causing this.
A run directory is on /work/noaa/noaatest/gwv/post/da. One just has to cd to the directory and execute the post.
module load slurm
module unload intel module load intel/2022.1.2 module unload impi module load impi/2022.1.2
export OMP_NUM_THREADS=1 export export KMP_STACKSIZE=1024m
intended
export POSTEXEC=/mnt/lfs4/HFIP/hfv3gfs/gwv/post/base/exec/upp.x export POSTEXEC=../../base/exec/upp.x export POSTEXEC=./upp.x
srun -n 60 $POSTEXEC >o 2>e
On Mon, Mar 7, 2022 at 10:41 AM Kyle Gerheiser @.***> wrote:
@GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA I was using the repo @WenMeng-NOAA https://github.com/WenMeng-NOAA pointed me to (/work/noaa/stmp/gkyle/post_gfsv16_hpc), and using the run script /work/noaa/stmp/gkyle/post_gfsv16_hpc/UPP/run_post_fv3gfsv16_ORION.sh. I am not familiar with how post works, but I was able to get the error when using that run script.
— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/379#issuecomment-1060824765, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FUIHE7ZSOBXAU24SG3U6YPR5ANCNFSM5N3IETPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
--
George W Vandenberghe
IMSG at NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
@.***
301-683-3769(work) 3017751547(cell)
@GeorgeVandenberghe-NOAA the directory is unreadable for me. Could you adjust the permissions?
ls: cannot open directory /work/noaa/noaatest/gwv/post/: Permission denied
In order to support the new GFSv16.2.0 (WCOSS2 port version) on Hera and Orion we need the same library module versions available. Below I list the versions that are currently being used in the new operational GFSv16.2.0 and which ones are missing on Hera/Orion.
Which software (and version) in the stack would you like installed?
Hera & Orion:
gempak/7.14.1cmake/3.20.2Which machines would you like to have the software installed?
Hera, Orion
Additional context
Here are the build.ver module versions for GFSv16.2.0: https://github.com/NOAA-EMC/global-workflow/blob/feature/ops-wcoss2/versions/build.ver
Refs: https://github.com/NOAA-EMC/global-workflow/issues/639