NOAA-EMC / hpc-stack

Create a software stack for HPC's
GNU Lesser General Public License v2.1
30 stars 36 forks source link

Identify who maintains stack where and establish a process for updating the stacks #83

Closed aerorahul closed 3 years ago

aerorahul commented 3 years ago

Clearly identify, who officially maintains a stack on which machine. There can be a back-up.

Establish a process for updating a stack and the versioning that goes with it:

And more.

edwardhartnett commented 3 years ago

Doesn't @GeorgeVandenberghe-NOAA usually do this?

GeorgeVandenberghe-NOAA commented 3 years ago

I am still without functional access due to the destruction of my GFE laptop 11/12 by a forced upgrade. I expect a repair by COB Friday 11/20

On Thu, Nov 19, 2020 at 11:13 AM Edward Hartnett notifications@github.com wrote:

Doesn't @GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA usually do this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/83#issuecomment-730479651, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FQ5FWSDPAMWN3FMF33SQU73NANCNFSM4T2SATAQ .

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

George.Vandenberghe@noaa.gov

301-683-3769(work) 3017751547(cell)

edwardhartnett commented 3 years ago

OK, great opportunity to identify some back-ups to @GeorgeVandenberghe-NOAA !

edwardhartnett commented 3 years ago

Can we start with an exhaustive list of machines we are responsible to install hpc-stack on? @GeorgeVandenberghe-NOAA which machines would you install on?

kgerheiser commented 3 years ago

Hang and I usually install hpc-stack.

Orion, Hera, Jet, WCOSS-Dell

edwardhartnett commented 3 years ago

Just those 4 machines then? Where do you install it? That is, under what root directory?

kgerheiser commented 3 years ago

https://github.com/NOAA-EMC/hpc-stack/wiki/Official-Installations

kgerheiser commented 3 years ago

Hang and I have kinda of being doing it ad-hoc. I think I installed it on Hera and Jet, and he did WCOSS and Orion.

I think he and I should split up which machines we're responsible for and formally document that.

GeorgeVandenberghe-NOAA commented 3 years ago

Do we have a non lmod capability so we can build it on gaea and wcossC ?

I would also like it to be THE stack we use on weird new machines like some azure cluster of the near future.

Sent from my phone

On Friday, November 20, 2020, Kyle Gerheiser notifications@github.com wrote:

Hang and I have kinda of being doing it ad-hoc. I think I installed it on Hera and Jet, and he did WCOSS and Orion.

I think he and I should split up which machines we're responsible for and formally document that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/83#issuecomment-731338519, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FTDNCUGIENMWETEX5TSQ2YW5ANCNFSM4T2SATAQ .

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

George.Vandenberghe@noaa.gov

301-683-3769(work) 3017751547(cell)

edwardhartnett commented 3 years ago

The README would be a good place to document this. We have authors and code manager, add a section "Installers."

kgerheiser commented 3 years ago

It can be built on systems without lmod, but then you don't have modules

climbfuji commented 3 years ago

I am doing cheyenne with both gnu and intel, and this one will likely stay with me.

I am currrently doing jet - hope to get rid of this responsibility once it is a tier-1 platform for the ufs-weather-model. Arun created an issue in the ufs-weather-model github repo to elevate jet to tier-1.

I am also doing gaea - ok to keep it as tier-2 platform as cheyenne, or pass it on to emc as a tier-1 platform.

On Nov 20, 2020, at 12:08 PM, Kyle Gerheiser notifications@github.com wrote:

It can be built on systems without lmod, but then you don't have modules

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/83#issuecomment-731355743, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5C2RITHLX5KLGYXQF3C2DSQ25BRANCNFSM4T2SATAQ.

kgerheiser commented 3 years ago

@climbfuji Hang or I can do Jet. We've been maintaining a build of hpc-stack on there

climbfuji commented 3 years ago

@kgerheiser this would be great, and a necessary first step to make jet a tier-1 platform. @arunchawla-NOAA created an issue for this work here: https://github.com/ufs-community/ufs-weather-model/issues/271 - once you install the stack on jet, can you please let the ufs-weather-model code managers (@junwang-noaa, @DusanJovic-NOAA, myself) know so that we can update the modulefile?

Going forward, we should continue the discussion and work towards making jet a tier-1 platform in the ufs-weather-model issue 271.

GeorgeVandenberghe-NOAA commented 3 years ago

I agree although Jet has the added issue that it is a heterogeneous platform with different node types and hardware. This makes resource specification in workflows where a job can land on any jet, a nuisance level problem. A module change at the admin level March 2019 broke our workflow enough we never really got it working again but it's definitely doable and tractable. A stack that looks the same across all platforms, will be a big advance. HPC-Stack does that. Admin modules don't. One of the big advantages of my ancient and obsolete tarball nceplibs distro was that module names were the same on all platforms .

On Fri, Nov 20, 2020 at 5:23 PM Dom Heinzeller notifications@github.com wrote:

@kgerheiser https://github.com/kgerheiser this would be great, and a necessary first step to make jet a tier-1 platform. @arunchawla-NOAA https://github.com/arunchawla-NOAA created an issue for this work here: ufs-community/ufs-weather-model#271 https://github.com/ufs-community/ufs-weather-model/issues/271 - once you install the stack on jet, can you please let the ufs-weather-model code managers (@junwang-noaa https://github.com/junwang-noaa, @DusanJovic-NOAA https://github.com/DusanJovic-NOAA, myself) know so that we can update the modulefile?

Going forward, we should continue the discussion and work towards making jet a tier-1 platform in the ufs-weather-model issue 271.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/83#issuecomment-731437755, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FXAXVBLMWJMPGC2ASLSQ3T33ANCNFSM4T2SATAQ .

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

George.Vandenberghe@noaa.gov

301-683-3769(work) 3017751547(cell)

climbfuji commented 3 years ago

Yes, hpc-stack does not use the nightmare flag -xHOST, which makes this possible.

The fact that jet has different node types and hardware is one reason why we need to make it a tier-1 platform - we need to make sure that our codes function in such an environment.

The ufs-weather-model currently works around the default AVX2 flags by compiling the model with multiple SIMD instruction sets on jet:

    elseif(SIMDMULTIARCH)
        set(CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS} -axSSE4.2,AVX,CORE-AVX2,CORE-AVX512 -qno-opt-dynamic-align")
        set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -axSSE4.2,AVX,CORE-AVX2,CORE-AVX512 -qno-opt-dynamic-align")

While this provides flexibility, it makes compiling a lot slower. We may consider other options such as only specifying -axSSE4.2,CORE-AVX2 or turning off SIMD instructions entirely on jet. TBD.

The rt.sh scripts currently compile and run on xjet, but there is no reason to keep doing this. We could run some tests on xjet, some on kjet, some on whatever-jet. TBD.

GeorgeVandenberghe-NOAA commented 3 years ago

When I build my old portable tarball NCEPLIBS, on jet I did it in a batch job on tjet to use the lowest instruction set possible. There were numerous cases where mine worked and the admins' didn't.

On Fri, Nov 20, 2020 at 5:57 PM Dom Heinzeller notifications@github.com wrote:

Yes, hpc-stack does not use the nightmare flag -xHOST, which makes this possible.

The fact that jet has different node types and hardware is one reason why we need to make it a tier-1 platform - we need to make sure that our codes function in such an environment.

The ufs-weather-model currently works around the default AVX2 flags by compiling the model with multiple SIMD instruction sets on jet:

elseif(SIMDMULTIARCH)
    set(CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS} -axSSE4.2,AVX,CORE-AVX2,CORE-AVX512 -qno-opt-dynamic-align")
    set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -axSSE4.2,AVX,CORE-AVX2,CORE-AVX512 -qno-opt-dynamic-align")

While this provides flexibility, it makes compiling a lot slower. We may consider other options such as only specifying -axSSE4.2,CORE-AVX2 or turning off SIMD instructions entirely on jet. TBD.

The rt.sh scripts currently compile and run on xjet, but there is no reason to keep doing this. We could run some tests on xjet, some on kjet, some on whatever-jet. TBD.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/83#issuecomment-731449340, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FVFUMU46QQBFHKD37DSQ3X6PANCNFSM4T2SATAQ .

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

George.Vandenberghe@noaa.gov

301-683-3769(work) 3017751547(cell)

edwardhartnett commented 3 years ago

Here's a summary from the comments above:

Machine Programmer
Orion Kyle
Hera Hang
Jet Kyle
WCOSS-Dell Hang
cheyenne Dom
gaea Dom
WCOSS-Cray Hang

Is that all of them?

I would suggest that we mark the release, then everyone install and report back either success or problems.

If there are problems, we hold the release, resolve the problems, and move the tag to the fixed release.

Once there are no problems and we are all happy with the release, we announce it, and move on to planning of the 1.2.0 release.

edwardhartnett commented 3 years ago

Elsewhere @mark-a-potts mentions a system called "acorn". Is that a NOAA system? Mark, do you want to try our our 1.1.0 release before we announce it? Or do you want to try building the develop branch?

Hang-Lei-NOAA commented 3 years ago

Acorn is the name of WCOSS2 machine

aerorahul commented 3 years ago

Lets leave wcoss2 (acorn) out of this release.

edwardhartnett commented 3 years ago

OK I've added an issue for acorn and assigned it to the next release (1.2.0).

arunchawla-NOAA commented 3 years ago

maybe we should create a milestone for 1.2.0 and identify issues to address for that? We need to add met plus libraries before we roll it out on WCOSS2. Has the met team created an issue for that?

edwardhartnett commented 3 years ago

@arunchawla-NOAA to add an issue to the next release, use the "Project" pull-down on the right side of the issue screen.

At each weekly meeting we will examine the issue list for the next release, and also place any new issues into a release. For release planning for the 1.2.0 release, see: https://github.com/NOAA-EMC/hpc-stack/projects/2

(New issues can also be added from this screen, or selected from the issue list and added to the release with the Add Cards button on upper right.)

There is as yet no issue for the met plus libraries, and I will add that now.

edwardhartnett commented 3 years ago

(@arunchawla-NOAA for release planning of the upcoming 1.1.0 release see https://github.com/NOAA-EMC/hpc-stack/projects/1).

junwang-noaa commented 3 years ago

May I ask who will maintain the hpc stack equivalent nceplibs module files on cray? If the library requires model code changes, the library needs to be installed on cray too.

On Sat, Nov 21, 2020 at 9:55 AM Edward Hartnett notifications@github.com wrote:

(@arunchawla-NOAA https://github.com/arunchawla-NOAA for release planning of the upcoming 1.1.0 release see https://github.com/NOAA-EMC/hpc-stack/projects/1).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/83#issuecomment-731590112, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TKGCEILW3CQWPNMQYTSQ7IHBANCNFSM4T2SATAQ .

kgerheiser commented 3 years ago

Hang will take care of WCOSS Cray.

DusanJovic-NOAA commented 3 years ago

Please reinstall hpc-stack on WCOSS2. It's broken after they renamed /lfs/h2 to /lfs/h1.

$ module show hpc/1.0.0-beta1 
------------------------------------------------------------------------------------------------------ 
  /lfs/h1/emc/nceplibs/noscrub/hpc-stack/test/noaa/modulefiles/stack/hpc/1.0.0-beta1.lua: 
------------------------------------------------------------------------------------------------------ 
help([[]]) 
conflict("hpc") 
setenv("HPC_OPT","/lfs/h2/emc/nceplibs/noscrub/hpc-stack/test/noaa") 
prepend_path("MODULEPATH","/lfs/h2/emc/nceplibs/noscrub/hpc-stack/test/noaa/modulefiles/core") 
setenv("LMOD_EXACT_MATCH","no") 
setenv("LMOD_EXTENDED_DEFAULT","yes") 
whatis("Name: hpc") 
whatis("Version: 1.0.0-beta1") 
whatis("Category: Base") 
whatis("Description: Initialize HPC software stack")

MODULEPATH still points to /lsf/h2.

Hang-Lei-NOAA commented 3 years ago

The test version you mentioned is being renewing and will be fully ready in an hour or so.

On Wed, Nov 25, 2020 at 3:32 PM Dusan Jovic notifications@github.com wrote:

Please reinstall hpc-stack on WCOSS2. It's broken after they renamed /lfs/h2 to /lfs/h1.

$ module show hpc/1.0.0-beta1

/lfs/h1/emc/nceplibs/noscrub/hpc-stack/test/noaa/modulefiles/stack/hpc/1.0.0-beta1.lua:

help([[]]) conflict("hpc") setenv("HPC_OPT","/lfs/h2/emc/nceplibs/noscrub/hpc-stack/test/noaa") prepend_path("MODULEPATH","/lfs/h2/emc/nceplibs/noscrub/hpc-stack/test/noaa/modulefiles/core") setenv("LMOD_EXACT_MATCH","no") setenv("LMOD_EXTENDED_DEFAULT","yes") whatis("Name: hpc") whatis("Version: 1.0.0-beta1") whatis("Category: Base") whatis("Description: Initialize HPC software stack")

MODULEPATH still points to /lsf/h2.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/hpc-stack/issues/83#issuecomment-733932908, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWSMFEK5HIOZLXF62O2J2LSRVSUPANCNFSM4T2SATAQ .