OP-DSL / OP2-Common

OP2: open-source framework for the execution of unstructured grid applications on clusters of GPUs or multi-core CPUs
https://op-dsl.github.io
Other
99 stars 47 forks source link

Simplify and unify Makefile build system #219

Closed bozbez closed 3 years ago

bozbez commented 3 years ago

Changes

New build system

The old build system was entirely removed and a new one with the following structure implemented:

As well as reducing duplication and inconsistency, the new system also allows for parallel builds of the OP2 libraries as well as the apps. In addition, incremental builds with automatically generated header dependencies (via -MMD -MP) now work.

Building

The process is kept as similar as possible to the previous system, but with more configurability. For a simple build:

  1. Set your compiler toolchain: export OP2_COMPILER={gnu, cray, intel, ...}
  2. Set your ParMETIS, PT-Scotch and HDF5 library directories: export {PARMETIS, PTSCOTCH, HDF5}_INSTALL_PATH=<path>
  3. Build the OP2 libraries: cd op2; make -j.
  4. Build an app: cd apps/c/airfoil_hdf5; make generate; make -j.

TODO

gihanmudalige commented 3 years ago

From initial (read as brief !) inspection, I am ok with this. A vast improvement to what we already have.

However, I think we might need to pull request this to merge to a "develop" branch before directly merging to master, just so that we have a working master branch for the ongoing Hydra work. This way we can approve PRs to develop and once all the TODOs, test script runs, and tests with OP2-Hydra passes we can merge to master. Thus also leading the way to a "Release".

What do you think ?

bozbez commented 3 years ago

I agree that it makes sense not to try to merge this directly if there is stability required for current Hydra work, but I'm not so sure about a "develop" branch - I think the more common pattern here would be to instead create a release branch for the Hydra work, which can then have changes cherry-picked from the master branch. Doing it like this avoids having to switch the develop branch with the master or trying to merge all of the commits from the develop branch. I am not entirely sure what is the right answer though - still a bit out of the loop regarding what work is happening/planned.

Regarding the TODOs and initial tests, I think those should be done directly on the feature/unified-make branch and then the PR updated, rather than merging into a master or develop branch and leaving that in a potentially dysfunctional state.

gihanmudalige commented 3 years ago

Ok, fine. I am happy to go for a release/OP2-Hydra-V1.0 branch which will basically be the current master that works with the RR Hydra code. However once the current planned changes to OP2 is merged and we have a proper OP2 release, we need to make sure that OP2Hydra works with this release. Afterwards we need to make sure new releases do not break OP2-Hydra as RR will be looking to work with it.

With regard to what work is planned, @bozbez you are actually in the loop. So the work discussed with the RR team at the recent meeting (moving OP2-Hydra to Hydra V20) will be the major piece of work. This will need to be done alongside the TODO list in #220

gihanmudalige commented 3 years ago

Come to Think of it some of the tasks in #220 may be best handled separately form #219. For e.g. the update to code generator and SYCL/HIP paralleizations would be a bit involved. So perhaps we only do the relevant tasks from #220 in this PR ?

bozbez commented 3 years ago

Come to Think of it some of the tasks in #220 may be best handled separately form #219. For e.g. the update to code generator and SYCL/HIP paralleizations would be a bit involved. So perhaps we only do the relevant tasks from #220 in this PR ?

Agreed, that was my intention at least - this PR is already a little larger than I would like.

bozbez commented 3 years ago

Seems that NVFORTRAN really has a few too many issues with parallel building to bother supporting it... the source of the issue is that it produces .o files in the directory that it was invoked in even if it is doing a combined compile link. The immediate solution to try and separate the .o compilation and linking doesn't work due to FORTRAN's module system. Attempts to fix that by generating the module files first with -Msyntax-only fail as there is no way to inhibit module file generation during compilation to .o, leading to a race hazard where one compiler instance is re-writing a module file with new content while another attempts to read it.

The only working solution that I found was to cd into a build directory and then invoke the compiler with adjusted paths, but this is rather messy. Since the builds of these apps is not particularly time consuming and no other FORTRAN compiler has this issue I've added a F_HAS_PARALLEL_BUILDS feature flag to the compiler Makefiles which simply disables parallel compilation (via .NOTPARALLEL:) for any misbehaving compilers.

bozbez commented 3 years ago

Worth noting also that we narrowly avoid this issue with the parallel build of the OP2 static libraries due to the only module inter-dependency being on op2_for_declarations.F90, which we manually specify as a make dependant for all the other .F90 files.

gihanmudalige commented 3 years ago

Not having Fortran parallel build is ok for now. I think we have bigger fish to fry ! However it wold be interesting to see this on Hydra. RR probably would not have tested their Hydra build with PGI/NV compilers, but we will have to due to cudafortran. Although I suspect the dependencies set up in the Hydra Makefiles may be managing this issue. I have not had trouble with nv_hpc compilers (except for internal compiler bugs .. thats a different issue).

bozbez commented 3 years ago

Two new environment variables added: HDF5_SEQ_INSTALL_PATH and HDF5_PAR_INSTALL_PATH that allow specification of both the parallel and sequential HDF5 libraries. If only the old HDF5_INSTALL_PATH is specified, then the flavour is auto-detected (as in #218) and the appropriate _SEQ_ or _PAR_ variant is set. If one of the variants is missing then the apps that depend on it will not be built.

HDF5 is now also linked as a shared library, as NVFORTRAN doesn't seem to handle the -l:libhdf5.a syntax correctly. The library directory from the INSTALL_PATH is now added to the rpath for convenience to avoid having to set LD_LIBRARY_PATH.

Disabling the MPI C++ bindings for MPT turned out to be non-obvious - there doesn't seem to be an official flag or environment variable to do this, so I ended up going with -DMPIPP_H to manually stop the header include which seems a bit hacky. One thing to consider for the future might be to drop the mpiXX compiler wrappers and just specify the include/link directories to the compiler.

gihanmudalige commented 3 years ago

Thanks @bozbez . We definitely need all of these flags documented as we are getting more and more of it !

bozbez commented 3 years ago

Do you have a preference to where that should go? Currently there are old build instructions in op2/README and apps/c/README but since a lot of the variables are now shared it might make more sense to add them in the top level README for now, until the readthedocs stuff comes around.

Might also be a good opportunity also to markdown-ify the README à la OPS.

gihanmudalige commented 3 years ago

@bozbez yes a toplevel README will do. Take a look at the README in OPS https://github.com/OP-DSL/OPS and its readthedocs https://ops-dsl.readthedocs.io/

It would be good to update the OPS READMEs and make both OPS and OP2 consistent as well (so we have a unified style / rules for both these repos). Any good things you do with OP2 we like to do in OPS as well !

bozbez commented 3 years ago

Gave the OpenMP4 offloading support a go with the NVHPC compilers which have offloading support by default. With a bit of tweaking the C non-mpi openmp4 variant builds, but for some reason isn't detecting and using the GPUs at least on Cirrus (and with no way to specify or control this). The FORTRAN openmp4 variants don't build due to an issue with the translator; the user kernel functions are missing a !$ omp declare target annotation. The MPI variants also fail to build, due to lack of runtime library support - given that these weren't built prior to this PR it is to be expected.

For now I've excluded all but the C non-mpi openmp4 variant from the all target, and added a TODO/openmp4 comment to some of the places that need some attention to get the full support working, but I think it is worth considering if this support should be in the master branch as it doesn't seem particularly ready. It might also be worth considering if it is wanted at all given that we might be pulling in the SYCL support soon.

reguly commented 3 years ago

Makes sense! OpenMP4 will be interesting on the Fortran side, as it's the only way to run on AMD GPUs from Fortran...

gihanmudalige commented 3 years ago

@bozbez , README looking good. There is a bunch of source files under the OP2-Common/scripts directory. Some of them are old, but we should retain at least one for each compiler. Unless there is an argument for not keeping these (@reguly ?) . On our HPC nodes we have latest Intel and nv_hpc setups and sourcefiles which I will commit to this branch, if so.

bozbez commented 3 years ago

So I was kinda hoping that we would be able to incorporate anything worthwhile from the scripts directory as one of the new profiles in more of a structured manner... the problem is that a lot of them reference actual user directories and the like which I don't really think is appropriate to put in this repository. See makefiles/profiles/cirrus-intel-nvhpc.mk for an example of something script-like as a profile.

There are a couple of issues that come to mind though - while the profiles work great for setting up compiler variables and environment paths they are still just makefiles, which means running stuff like source ... and module load ... won't work, and even if they did running them on every make invocation is almost certainly a bad idea.

Personally I think that anything running setup with module loads and etc should be written as a convenience by the user for their specific environment if they feel they need such a thing - otherwise we end up committing files that are specific really only to one user on one machine and are of minimal use as anything but a rough example for other people. These are also liable to be left unmaintained and unused, with no clear point of clean up...

gihanmudalige commented 3 years ago

@bozbez I can certainly see your point. I can see similarities with the source files with what you have in makefiles/profiles/cirrus-intel-nvhpc.mk, Now in it we have things like NVCC :=/lustre/sw/nvidia/hpcsdk-219/Linux_x86_64/21.9/compilers/bin/nvcc which surely will be a module load on Cirrus, then simply using nvcc will work for compilation. Given as you say module loads should not be used in a Makefile, are these "profiles" useful at all ? A user could think that something that can be built with a simple module load needs the absolute paths set to get OP2 working. On HPC clusters sometimes finding these paths can be quite tricky. Besides if Cirrus changes these locations then that would be again problematic in terms of utility and maintenance. In which case we should altogether not have such profiles.

gihanmudalige commented 3 years ago

@reguly any comment on this ? I am sort of leaning towards the simple source files we have already - just to showcase the setting up of the environment. But happy to do otherwise.

reguly commented 3 years ago

I never really used the source files committed to these repos, they were always hardcoded to your setup :) so I am fine without having those committed in as @bozbez suggests

gihanmudalige commented 3 years ago

@reguly actually I was more considering whether its useful to a user to see one of these source files. Of course the hard coded setup in the profiles also has the same issue.

reguly commented 3 years ago

I don't think that is particularly relevant, as all the environment variables a properly documented.

gihanmudalige commented 3 years ago

@reguly ok. great. Lets leave it at that. I am happy with this.

gihanmudalige commented 3 years ago

@bozbez Thanks, your most recent push fixed the build issue and Hydra is subsequently running the basic Rotor 37 problem correctly. Continuing testing with the coupler now.

bozbez commented 3 years ago

@bozbez I can certainly see your point. I can see similarities with the source files with what you have in makefiles/profiles/cirrus-intel-nvhpc.mk, Now in it we have things like NVCC :=/lustre/sw/nvidia/hpcsdk-219/Linux_x86_64/21.9/compilers/bin/nvcc which surely will be a module load on Cirrus, then simply using nvcc will work for compilation. Given as you say module loads should not be used in a Makefile, are these "profiles" useful at all ? A user could think that something that can be built with a simple module load needs the absolute paths set to get OP2 working. On HPC clusters sometimes finding these paths can be quite tricky. Besides if Cirrus changes these locations then that would be again problematic in terms of utility and maintenance. In which case we should altogether not have such profiles.

So the reason the nvcc executable is hardcoded there is because of the Cirrus MPI module struggling to sort out it's paths when there was more than one compiler loaded - in particular if you load intel-compilers-19 and then nvidia/nvhpc then you ended up with a broken mpicxx. Since the compilers on Cirrus are manually inserting rpaths into /lustre/sw already so that you don't need module loads to run previously compiled executables, I think that hardcoding the path here is an acceptable workaround.

I think its also worth noting the difference between hardcoding paths to shared resources on known clusters and hardcoding paths into a user folder which others can't access.

The intention of the profiles, and the reason they are there instead of source files, is that they allow the user to set make variables after the "configuration" of the compilers and the libraries. This allows you to easily override or the append to e.g. CXXFLAGS and HDF5_LIB, whereas source files would only immediately allow you to override the variables. The solution here for the source files would be to add _EXTRA variables to the end of each configure variable (which might still be worth doing). Even with the _EXTRA variables though, the profiles still have the advantage of being able to do conditional environment setup using ifdef and the lot.

Source files still have a place, but rather than using them for cluster common setup like -march= and compiler paths they should be used as a shortcut to set OP2_PROFILE and X_INSTALL_PATH as well as doing the module loads. However I think rather than committing them to the repo, it is reasonable to expect the user to produce these in a directory above as a shortcut to manually setting the variables each time - as long as the variables are clearly documented.

gihanmudalige commented 3 years ago

I have tested OP2-Hydra and OP2-Hydra with the coupler with this new OP2 branch. All tests passes. We should get Arun to switch to this version, when we are ready to merge this pull request.

bozbez commented 3 years ago

So unfortunately I haven't been able to actually get an OpenMP 4.0 offloading build to offload - on Cirrus with the NVHPC SDK compilers the binaries fail to find the GPU, and on my new work machine with GCC built for offloading the same happens but perhaps this time it might be that the GCC offloading doesn't support Ampere. The offload compilers are being invoked however, and the PTX is generated, but no luck at runtime.

Since the builds complete and compiler flags seem fine and are equivalent to the old build system I am going to mark this task as complete despite not being able to actually run the binaries.

In general I think this is pretty much ready to merge now, there are perhaps a few things that might need tweaking over time (and perhaps for Spack) but that should not pose a significant problem given the level of flexibility in the unified system.

reguly commented 3 years ago

Which software stack did you use on Cirrus? The 21.2 had problems with not fining the GPU, but the latest works. Also, I just got an email from Cirrus support earlier today, that 21.2 should be fixed too.

On 2021. Dec 1., at 13:04, Joe Kaushal @.***> wrote:

So unfortunately I haven't been able to actually get an OpenMP 4.0 offloading build to offload - on Cirrus with the NVHPC SDK compilers the binaries fail to find the GPU, and on my new work machine with GCC built for offloading the same happens but perhaps this time it might be that the GCC offloading doesn't support Ampere. The offload compilers are being invoked however, and the PTX is generated, but no luck at runtime.

Since the builds complete and compiler flags seem fine and are equivalent to the old build system I am going to mark this task as complete despite not being able to actually run the binaries.

In general I think this is pretty much ready to merge now, there are perhaps a few things that might need tweaking over time (and perhaps for Spack) but that should not pose a significant problem given the level of flexibility in the unified system.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OP-DSL/OP2-Common/pull/219#issuecomment-983575526, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJWVVP6CBCKL22RDNAQHG3UOYFNNANCNFSM5IAA5ZAQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

bozbez commented 3 years ago

I was using the default loaded by nvidia/nvhpc which was 21.9 I think? I'll give it another go (if the login node stops timing out...).

gihanmudalige commented 3 years ago

@bozbez looks good. @reguly and @bozbez we should now move the current master to a release/2020 branch and merge this branch to master. Assuming this naming convention is ok. This PR is working with Hydra as I noted above, but I will need to push some minor modifications to Hydra which I can do soon.

Additionally, I was thinking of making a default develop branch (as we do in OPS). Did we have an agreement on if this is ok ? I would rather like to keep both OP2 and OPS similar as possible.

reguly commented 3 years ago

Thanks for setting this up! I agree with the separate branch for development!

On 2021. Dec 1., at 14:15, Gihan R. Mudalige @.***> wrote:

@gihanmudalige approved this pull request.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OP-DSL/OP2-Common/pull/219#pullrequestreview-820240969, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJWVVMPXYBS5IIZ4WDJTLTUOYNYFANCNFSM5IAA5ZAQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

reguly commented 3 years ago

I have created the release/2020 branch and pushed it in. I also created a v1.0 tag at the same commit.

On 2021. Dec 1., at 14:26, István Reguly @.***> wrote:

Thanks for setting this up! I agree with the separate branch for development!

On 2021. Dec 1., at 14:15, Gihan R. Mudalige @. @.>> wrote:

@gihanmudalige approved this pull request.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OP-DSL/OP2-Common/pull/219#pullrequestreview-820240969, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJWVVMPXYBS5IIZ4WDJTLTUOYNYFANCNFSM5IAA5ZAQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

bozbez commented 3 years ago

Personally I don't think we gain anything from having a develop branch and a master branch; the feature development is already done in separate feature branches and the stable versions are now in release branches.

I agree that we should try and align with OPS though, so there's that.

gihanmudalige commented 3 years ago

@reguly thanks for creating the relevant branches. So we are going with the convention of v1.0 etc. (no problem). I suppose this PR and also the completion of the other tasks will soon get us to v2.0 ? Particularly the new code gen should take us to 2.0 I think.

W.R.T develop branch, I think it will help external developers PR on to the develop branch and then us being able to test it and merge to develop. This way we are sure that the master branch is always working and we periodically merge in develop branch as major features are added. So in affect we have a leading edge branch and a "stable" branch.

reguly commented 3 years ago

What we are not used to doing (but we probably should!) is doing releases, which would also help address this master/develop branch difference (well, it would obviate it)

On 2021. Dec 1., at 15:14, Gihan R. Mudalige @.***> wrote:

 @reguly thanks for creating the relevant branches. So we are going with the convention of v1.0 etc. (no problem). I suppose this PR and also the completion of the other tasks will soon get us to v2.0 ? Particularly the new code gen should take us to 2.0 I think.

W.R.T develop branch, I think it will help external developers PR on to the develop branch and then us being able to test it and merge to develop. This way we are sure that the master branch is always working and we periodically merge in develop branch as major features are added. So in affect we have a leading edge branch and a "stable" branch.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.