Document running a simple example or test manually

zbeekman commented 5 years ago

The JOSS review asks reviewers to verify:

Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).

While the primary purpose of this work is to extend the build system and enable and validate portability and build system simplification, which it certainly has done, it would be good to include on the readme additional things, even though they may be more the responsibility of the upstream project. (Of course you can quote with attribution, where appropriate.)

I think at a minimum, there are two pieces of information that need to be included in---or linked to from---the readme:

How to run a very basic simulation once the software has been built, preferably with parallelism enabled.
How to run any existing unit/assessment/integration tests manually.

It's great that the CI pipeline does some assessment & acceptance testing automatically, but it would be beneficial for any human who wishes to verify that their installation is working to run some automated tests themselves, and setup and run a very simple example. Currently, there are no instructions on how to do this.

dmey commented 5 years ago

How to run a very basic simulation once the software has been built, preferably with parallelism enabled.

I think this can be shown through the tutorial we already have in the GIS4WRF plugin. I can create a new one based on this and have it under the tutorials section in GIS4WRF to avoid creating an additional website for WRF-CMake. How does this sound @zbeekman, @letmaik?

zbeekman commented 5 years ago

A brief description with a link to more detailed instructions/documentation is fine, IMO, since the main point of the WRF-CMake project is to provide a new build system for the upstream project. You're certainly welcome to add your own details, but, for someone who has never used WRF before it would be good to point them in the right direction. You're welcome to add more detailed instructions/examples but pointing a user to existing WRF documentation upstream or in other projects would be suitable (for me at least) provided that it is sufficiently clear and up-to-date with this project/work.

letmaik commented 5 years ago

@zbeekman I would normally agree that linking to official WRF docs is the right thing to do. Unfortunately they are huge and in my opinion hard to follow. There's no simple end-to-end example as far as I see. That's why @dmey was suggesting using the existing GIS4WRF tutorial and copy it to a non-GUI tutorial. As a side effect we can also point users to the GIS/visualization capabilities of GIS4WRF, which is very helpful for such tutorials. It's a tiny bit of extra work but I think worth it.

zbeekman commented 5 years ago

It's a tiny bit of extra work but I think worth it.

Sounds good to me. If there are no easy to follow examples, then it may make sense to provide your own.

dmey commented 5 years ago

@zbeekman and @letmaik I have update the README.md with some clean up, and additional information re docs and example usage as requested. See https://github.com/WRF-CMake/WRF/blob/dmey/docs/README.md. The How to cite section will need to be completed if/once the paper is accepted but it would be good to start having a feedback on how to phrase this section as well.

With regards to the tutorial using WRF-CMake on the GIS4WRF website I link from the README.md, I am keeping it on a branch at https://github.com/GIS4WRF/gis4wrf.github.io/tree/dmey/tutorials just in case there are additional changes to make -- all changes from this branch are incorporated into the live website.

@letmaik are you happy with the revised installation section in README.md? I think it makes it easier/clearer to follow now esp with the additional brew section as I think that most macOS and Linux end-users will probably want to install using Homebrew/Linuxbrew since it makes things very straightforward.

@letmaik, @zbeekman raised a good point about testing (see 2.). I don't this we have addressed it -- let me know cos this may be a bit tricky to handle if have to support testing for end users...

letmaik commented 5 years ago

@dmey There is no separate wps formula. This gets installed together with wrf. See the formula for details why. Also, I would split this into two lines and not use &&. Otherwise it looks fine. I'm not sure that Linuxbrew is really popular yet, but that's ok, as it's just an option.

Regarding unit testing, well, WRF doesn't really have tests. Running a simulation and looking at the results is the best you will get.

dmey commented 5 years ago

@zbeekman in the online tutorial I use the link_grib.py. This is just an FYI but link_grib.py has the new cli arguments only in the wps-cmake branch, not in the last binary release, so if you build from release please use the new link_grib.py in the wps-cmake branch.

dmey commented 5 years ago

@letmaik now moved to two lines. Re the other issue I don't think he was referring to unit testing but to what we do already in CI:

How to run any existing unit/assessment/integration tests manually.

It's great that the CI pipeline does some assessment & acceptance testing automatically, but it would be beneficial for any human who wishes to verify that their installation is working to run some automated tests themselves, and setup and run a very simple example. Currently, there are no instructions on how to do this.

zbeekman commented 5 years ago

From the tutorial:

brew tap osgeo/osgeo4mac &&
brew install qgis3 && brew install qgis3 && # avoid issues with current formula

The qgis formula is very complex. I had to ulimit -n 1024 to get it to install because the formula itself (i.e., processing the ruby) exceeds macOS's default max number of open files. Also, I'm somewhat alarmed by the complexity of the qgis formula, the number of formula they have shadowing core and their reliance on outdated/deprecated homebrew functionality. It looks very fragile.

Anyway, that's not your problem.

However, in the tutorial, if I had known ahead of time just how many packages and duplicate packages were needed for qgis3 via Homebrew, then I probably would have just done brew install ncview. I wish they didn't have that Python 3.6 dependency, then you could just write a cask for qgis and install it from the pre-packaged binary installer.

Despite this, your plugin and its ability to integrate with and drive WRF looks powerful. I'm hoping these 100 prerequisite packages for qgis will finish installing sometime tonight so that I can finish the tutorial and checkout the plugin.

dmey commented 5 years ago

My bad! I should have suggested to just use ncview in your case --The reason we did not tie the tutorial to a specific software to view WRF netCDF output files was because this is not really part of WRF-CMake and because we wanted to leave it as much cross-platform as possible. I appreciate the concerns with the installation on macOS but as you said, this another problem altogether. On Windows, the standalone QGIS is really straightforward to install for example (and I think this where the main QGIS user-base is found...). Let me know if you get stuck but hopefully it'll work without having to spend much time on this! 😄

letmaik commented 5 years ago

@letmaik now moved to two lines. Re the other issue I don't think he was referring to unit testing but to what we do already in CI:

How to run any existing unit/assessment/integration tests manually.

It's great that the CI pipeline does some assessment & acceptance testing automatically, but it would be beneficial for any human who wishes to verify that their installation is working to run some automated tests themselves, and setup and run a very simple example. Currently, there are no instructions on how to do this.

# Note: The following involves downloading 1 GB of reference data and running simulations for 10-30min.

git clone https://github.com/WRF-CMake/wats.git

# Install Python packages, either via conda:
conda env create -n wats -f wats/environment.yml
conda activate wats
# Or via pip:
pip install -r wats/requirements.txt

# Run test cases
# E.g. for brew: --wrf-dir $(brew --cellar wrf-cmake)/4.1.0/wrf --wps-dir $(brew --cellar wrf-cmake)/4.1.0/wps
python wats/wats/main.py run --mode wrf --mpi --wrf-dir /path/to/wrf --wps-dir /path/to/wps
# Note: replace Linux with macOS/Windows as appropriate
mv wats/work/output wats_Linux_CMake_Release_dmpar

# Download reference data to compare against
# 1. Go to https://dev.azure.com/WRF-CMake/WRF/_build?definitionId=5
# 2. Select a successful build from Branch "wrf-cmake"
# 3. Click on Summary
# 4. Download wats_Linux_Make_Debug_serial build artifact (~1 GB)
# 5. Extract archive to current folder

# Plots
python wats/wats/plots.py compute wats_Linux_Make_Debug_serial wats_Linux_CMake_Release_dmpar
python wats/wats/plots.py plot --skip-detailed
ls wats/plots
# Compare magnitudes in nrmse.png and ext_boxplot.png with plots in JOSS paper.

@zbeekman This replicates what the CI does. I realize it could be a bit more automated. Also, the fact of having to download 1 GB reference data is not ideal. Still, do you think think it is sufficient for the time being? Anything different would probably mean creating new test cases etc. which I'd like to avoid.

zbeekman commented 5 years ago

@zbeekman This replicates what the CI does. I realize it could be a bit more automated. Also, the fact of having to download 1 GB reference data is not ideal. Still, do you think think it is sufficient for the time being? Anything different would probably mean creating new test cases etc. which I'd like to avoid.

Yes, this is fine. Sorry I've had to attend to some pressing work stuff the past few days. To finish my review, I think the only remaining tasks are:

Try out the pre-compiled (experimental) binaries (with the understanding that they are experimental and may or may not work)
Finish going through the example from the online tutorial
Try a build somewhere exotic on linux (like a Cray or HPE/SGI machine)

I'll do my best to wrap this up tomorrow

letmaik commented 5 years ago

@zbeekman This replicates what the CI does. I realize it could be a bit more automated. Also, the fact of having to download 1 GB reference data is not ideal. Still, do you think think it is sufficient for the time being? Anything different would probably mean creating new test cases etc. which I'd like to avoid.

Yes, this is fine.

I've documented this now in https://github.com/WRF-CMake/WRF/commit/bb0eb92c449ad1141d9dd3df9f3824f0046fca01.

dmey commented 5 years ago

Hi @zbeekman, have you been able to look at these https://github.com/WRF-CMake/WRF/issues/24#issuecomment-503301464? Thanks. Let me know if there's something not clear or that looks like it would take a lot of your time and I can simplify the process! 😃. Thanks

zbeekman commented 5 years ago

Sorry for being out of touch, I got inundated. I'm testing on SGI/HPE (Intel) and Cray/Cray right now. The builds seem to get a bit bogged down, I'm hoping I don't have to go get my own node to do the builds, but we'll see.

I'm not sure if you're using CMake's standard FindNetCDF or a custom one, but the SGI/HPE machine would populate NETCDFC_HOME and NETCDFFORTRAN_HOME, and on the Cray, NETCDF was installed by cray and available as: module load cray-netcdf and both C and Fortran were available from nc-config --prefix. It would be nice, assuming you're rolling your own FindNetCDF, to do a little more introspection to search in, e.g., nc-config --prefix for the user.

Also, it looks like WRF doesn't like being compiled against Cray's libhugetlbfs, FWIW.

libhugetlbfs [cray.hpc.system.foo:27646]: WARNING: New heap segment map at 0x10000000000 failed: Cannot allocate memory

I've also verified the pre-built binaries seem to work well. I like what you did with packaging the dylibs. That's pretty clever.

All that remains is to wait for the SGI/HPE and Cray builds to finish 🤞 and finish going through the example problem (which I'll do while I wait for the builds)

dmey commented 5 years ago

No problems! We recently changed the way FindNetCDF finds the library as we had a few issues on some systems where NetCDF-C and NetCDF-Fortran used different directories. Are you using:

cmake -DNETCDF_DIR=<path_to_netcdf-c-dir> -DNETCDF_FORTRAN_DIR=<path_to_netcdf-fortran-dir> ..

We updated Note for HPC users relying on the Modules package recently...

Let me test this on the Cray at my end to see if I also get a few issue -- it has been a few weeks since I last tried on Cray. With regards to the search using nc-config --prefix, I believe @letmaik looked at this but if I remember correctly we decided it was not worth doing -- but may be mistaken!

zbeekman commented 5 years ago

Are you using:

cmake -DNETCDF_DIR=<path_to_netcdf-c-dir> -DNETCDF_FORTRAN_DIR=<path_to_netcdf-fortran-dir> ..

Yes, and that works completely fine. On the cray, those are both the same directories and are set to the output of nc-config --prefix. My suggestion is just to test nc-config --prefix if you can find it to see if it is suitable for passing to FindNetCDF etc. so the user doesn't have to provide it.

This is really not an issue at all, merely a suggestion since it is nice to find things to save the user typing and hunting down paths etc.

I'm not sure what's going on on my Cray build right now. It's been stuck right after building and linking fftpack (44%). I cloned from master though, so maybe I should go back and grab a tag. Eventually the Cray compiler spit out the following cryptic and confusing message:

ftn-3178 crayftn: LIMIT in command line
  The compiler cannot open file "mod/MODULE_CONFIGURE.mod".  This file was either created or previously accessed in this compilation, so should be available.

But the process doesn't appear to have been killed yet, so I'm gonna let it sit there a little longer. It is a Cray XC40/50 with Intel Xeon E5-2699v4 Broadwell CPUs on standard compute and head nodes, running SLES with 8GB available to users on the head nodes. Now that I've looked up the available memory, I'm guessing I'm spilling into swap or otherwise running out of memory during the build so I'll grab a batch/compute node and try the build there.

zbeekman commented 5 years ago

@dmey: I'd give the cray build another shot; for me, it still seems to stall out (albeit, I am on a batch node, now, not a compute node, but I still have more memory here than the login node... I may try to kill this and restart it on the batch node proper...)

Here are the modules I have loaded:

Currently Loaded Modulefiles:
  1) modules/3.2.10.6                                  7) ugni/6.0.14.0-6.0.7.1_3.13__gea11d3d.ari         13) dvs/2.7_2.2.120-6.0.7.1_12.1__g74cb2cc4          19) cray-mpich/7.6.3                                 25) cray-hdf5/1.10.0.3
  2) cce/8.6.4                                         8) pmi/5.0.12                                       14) alps/6.6.43-6.0.7.1_5.46__ga796da32.ari          20) java/jdk1.8.0_152                                26) cray-netcdf/4.4.1.1.3
  3) craype-network-aries                              9) dmapp/7.1.1-6.0.7.1_6.2__g45d1b37.ari            15) rca/2.2.18-6.0.7.1_5.48__g2aa4f39.ari            21) eproxy/2.0.22-6.0.7.1_7.5__g1ebe45c.ari
  4) craype/2.5.13                                    10) gni-headers/5.0.12.0-6.0.7.1_3.11__g3b1768f.ari  16) atp/2.1.1                                        22) craype-broadwell
  5) cray-libsci/17.11.1                              11) xpmem/2.2.15-6.0.7.1_5.11__g7549d06.ari          17) perftools-base/6.5.2                             23) pbs
  6) udreg/2.3.2-6.0.7.1_5.13__g5196236.ari           12) job/2.2.3-6.0.7.1_5.44__g6c4e934.ari             18) PrgEnv-cray/6.0.4                                24) ccm/2.5.4-6.0.7.1_5.27__g394754f.ari

And I'm trying to build with:

Cray Fortran : Version 8.6.4  Mon Jul 15, 2019  13:13:51

I'm going to try launching the build on the compute node itself via aprun to see if I can get it to work without running out of memory, etc.

zbeekman commented 5 years ago

@letmaik just to confirm that the results I'm seeing make sense:

Is the % relative error in w high because w is close to zero? (Is w the radial/out-of-plane/vertical direction?)

zbeekman commented 5 years ago

Also, unless anyone disagrees, let's close this, I'm satisfied with the tutorial and LOCAL.md documentation.

letmaik commented 5 years ago

@zbeekman Glad to see you're back :) I created #36 but we'll not do this for the coming release, mostly due to current lack of resources to do it properly in CMake and test it. I haven't seen a FindNetCDF flying around on the web which does that, otherwise we could have stolen that.

letmaik commented 5 years ago

@zbeekman Regarding interpretation of results, w (vertical component of wind velocity) is around 0.01, so yes, but I don't remember exactly what the reason could be. The take away is that you see the same errors with the existing Makefile-based build, and we're not trying to solve that.

zbeekman commented 5 years ago

I haven't seen a FindNetCDF flying around on the web which does that, otherwise we could have stolen that.

Gah, I always forget that CMake doesn't maintain one... Also, newer versions of NetCDF ship a CMake build system which should be capable of installing a CMake Package Config file to export the installed targets, but, IIRC, last time I tried the NetCDF CMake build I hit some errors. In an ideal world, people would have the CMake build generate pkg-config files for NetCDF AND CMake package config files, then stop installing it with auto-tools. I'm not sure that will ever happen though.

zbeekman commented 5 years ago

w (vertical component of wind velocity) is around 0.01

I just read the latest draft of the JOSS paper which reminded me of that fact shortly after I typed that, but thanks! This makes sense to me; most of the time there isn't much vertical convection and the atmosphere is often (usually) vertically stabilized due to the negative lapse rate. (If my memory from the one geophysical fluid dynamics course I took in grad school is accurate.)

If the magnitude of w is small then relative errors may seem much larger than absolute errors. Also, you can conceptualize a "condition number" for computing the variance (c.f. Chan & Lewis 1978) and if the variance is really small relative to the mean, your choice of algorithm for computing the variance can introduce numerical error.

I wonder why there is such a big difference between OSes, though. Maybe GCC needs to be compiled differently on macOS? Or maybe it's generating different/strange machine code on macOS?

T.F Chan and J.G. Lewis. Rounding error analysis of algorithms for computing means and standard deviations. Technical Report 284, Johns Hopkins University, Department of Mathematical Sciences, 1978.

dmey commented 5 years ago

Speculatively, I would attribute the large errors we see in w due to convection -- I have not looked into this but given the amount of parameterization involved I am not surprised to see larger errors in the vertical then in the horizontal components. The take home message is still that after t0 results deviates from each others but more due to a change in platform than a change in build system.

zbeekman commented 5 years ago

FYI, build seems to be progressing (VERY SLOWLY) on a compute node on the Cray. I wonder if the Cray compiler is just slower? Or maybe it's doing some very aggressive link-time optimization? I think Cray's link statically by default most of the time, so perhaps that paired with link-time optimization makes the linker very slow.

Also, I updated the JOSS issue to indicate I'm finished with my review, and recommend (enthusiastically) publication pending merging the PR with contributing guidelines etc. Hopefully this will help spur the other reviewer into action.

dmey commented 5 years ago

Thanks @zbeekman you are too fast! 😄

I wonder why there is such a big difference between OSes, though. Maybe GCC needs to be compiled differently on macOS? Or maybe it's generating different/strange machine code on macOS?

Great question!

dmey commented 5 years ago

With regards to Cray! I have tried to compile master in Debug mode and was able to get to 80 % but then got the following error:

ftn-855 crayftn: ERROR MODULE_MP_JENSEN_ISHMAEL, File = ../../../../../../wrf/phys/module_mp_jensen_ishmael.F, Line = 1, Column = 8
  The compiler has detected errors in module "MODULE_MP_JENSEN_ISHMAEL".  No module information file will be created for this module.

ftn-1725 crayftn: ERROR GAMMAP, File = ../../../../../../wrf/phys/module_mp_jensen_ishmael.F, Line = 4516, Column = 18
  Unexpected syntax while parsing the WRITE statement : "operand" was expected but found ",".

please let me know if you also get the same. In Release, it's just too slow! Waited for 1 hour and still at 44 % so not sure what is going on there. Using Cray Fortran Version 8.5.8.

@letmaik I believe that this may actually be an issue with the latest branch as when we tested this a while ago on Cray there were no issues. I do not actually think many use WRF on Cray...

letmaik commented 5 years ago

@dmey Possibly, I wouldn't be surprised if WRF 4.1 introduced new issues with Cray as they don't seem to regularly test on that. https://github.com/WRF-CMake/wrf/blob/795c293825210c76888311f73a5e22cad45f7ad8/phys/module_mp_jensen_ishmael.F#L4515-L4516 Yep, the comma in 4516 is too much... gfortran is probably more forgiving.

zbeekman commented 5 years ago

At one point I saw a similar error, but I'm not 100% sure it was the same. That may actually be an out of memory issue or similar, because when I switched to the compute node, it seems to have gone away. I've been compiling for ~3 hours on a release build.

Yeah, 4516 is not valid Fortran syntax.

letmaik commented 5 years ago

I fixed the syntax error. If anyone wants to retry, feel free.

zbeekman commented 5 years ago

Takes too long... haha. If upstream isn't testing regularly, I'll assume the CMake build works approximately as well or better (for Cray) based on the evidence I've seen so far.

dmey commented 5 years ago

Removal of comma in 4516 fixes the issue. I have been able to successfully build master on Cray using Cray C/Fortran Compiler Version 8.5.8, mode Serial, build type: Debug. I did not try Release as it takes far too long!

WRF-CMake / wrf

Document running a simple example or test manually #24