warm-up JASMIN tutorial for Zhenkun

sunt05 commented 5 years ago

@hamidrezaomidvar, I just got @zhenkunl safely landed on Jasmin 🎉 Can you give give him some orientation on how to use the compiled WRF-SUEWS and WPS to run his Shanghai case? I know you have something for Shui, which might be useful for @zhenkunl as well.

Many thanks 🤝

sunt05 commented 5 years ago

@zhenkunl once you get some results for Shanghai, we can close this issue.

hamidrezaomidvar commented 5 years ago

Great! @zhenkunl I assumed you have used WPS. So the first step is to get standard wrfinputs for Shanghai using WPS. After you do that, then you should modify the inputs for the coupled version. I will write you a complete tutorial on this later one. But Let's start with having standard wrfinputs first. In addition, here some tips for running WRF-WPS in Jasmin:

When configuring WPS, and WRF in Jasmin, we need to use Intel compilers. For this purpose, before starting to configure or compile WPS or WRF, put it in .bashrc file and source it.

module load intel/15.1
module load intel/mpi/5.1.2.150
export NETCDF=/apps/libs/netCDF/intel15/fortran/4.4.1
export WRFIO_NCD_NO_LARGE_FILE_SUPPORT=1
export J='-j 6'
export NETCDF_classic=1
export WRF_EM_CORE=1

For Runs, usejasmin-sci3.ceda.ac.uk otherwise you would get to memory problems.
WRF4 has a new method for the number of processors. The total number grids assign in to each processor in x or y direction should not be less than 10. You might get to this problem, but it is easy to fix.
Here a simple bash script for running jobs in Jasmin:

#!/bin/bash 
#BSUB -q par-multi 
#BSUB -n 49
#BSUB -o %J.out 
#BSUB -e %J.err 
#BSUB -W 24:00

echo "Running WRF"
# (./real.exe for generating wrfinputs )
mpirun ./wrf.exe

zhenkunl commented 5 years ago

Thanks for your detailed explanation @hamidrezaomidvar. I will try to get with Jasmin first. I will ask for your help when I experience difficulties.

zhenkunl commented 5 years ago

Hi @hamidrezaomidvar. What is the difference of wrf.exe under hamid/xx-test-xx-2 or xx-test-xx-3 or xx-test-xx-4? Which one is the newest?

hamidrezaomidvar commented 5 years ago

These are some of the local test I am doing right now. Try xx-test-xx-2 if like to run a case. Others are the test that I have not merge to the master! also please clone master of WRF-SUEWS since the test-dev is still have some problems that I am fixing now.

sunt05 commented 5 years ago

BTW, I'd like to comment on the "best practise" for organising our WRF runs as I can see more regions will be tested and applied with our coupled system.

split wrf.exe and other related static data files (e.g., those profile-like data generated by WRF itself for a specific version) from your cases with input and output files and ; so all binaries stay in one place;
furthermore, split input files from cases; for e.g., a specific case usually only have one set of forcing conditions, which should come from a single place.

Then, ideally, we would have a structure like this:

├── WRF-exe
│   ├── wrf.exe.orig-4.0
│   ├── wrf.exe.orig-4.1
│   ├── wrf.exe.suews-4.0
│   └── wrf.exe.suews-4.1
├── cases
│   ├── London-GMD-paper
│   └── London-test-201504
├── wrf-data
│   ├── CAM_ABS_DATA
│   ├── CAM_AEROPT_DATA
│   ├── CAMtr_volume_mixing_ratio.A1B
│   ├── CAMtr_volume_mixing_ratio.A2
│   ├── CAMtr_volume_mixing_ratio.RCP4.5
│   ├── ...many other files...
│   ├── tr49t85
│   ├── tr67t85
│   └── wind-turbine-1.tbl
├── wrfbdy
│   ├── London
│   │   ├── 201504
│   │   └── 201507
│   └── Shanghai
│       └── 201509
└── wrfinput
    ├── London
    │   ├── MODIS
    │   ├── MODIS-SUEWS
    │   └── MODIS-updated
    └── Shanghai
        ├── MODIS
        └── MODIS-updated

By adopting such a structure, we can set up different runs under the cases folder and link configurations and binaries from other places; also, as we are linking files, we know what original information is and how we can proceed from there.

In the above structure, the wrfinput part might need to be changed according to different initial conditions for specific cases, but I put it separately for the geographic data, which usually needs quite amount of work to set up but won't change across runs of a specific region. So instead of link, under certain scenarios, we'd better copy them to the cases folder.

hamidrezaomidvar commented 5 years ago

@zhenkunl @sunt05

Here is a brief instruction on Preprocessing scripts:

After generating the original wrfinput files, you should follow the following to modify them and use them for the runs:

Under WRF-SUEWS/wrfinput-processor/: there are 4 main folders with different functionalities:

---> /input-checker This folder contains a script that check if the SUEWS parameters are being inputted to WRF are in current range and logic. It is still going on, and not completed but you do not need this step to modify the inputs.

---> /param_extractor_SuPy: this folder contains scripts that (first) runs SUEWS offline using 2012 London or Swindon parameters to spin up the model and (second) extracts all the parameters needed for SUEWS to be inputted in the WRF. Finally it puts them in two files SUEWS_param_new.json and namelist.suews.new. The first one (SUEWS_param_new.json) contains parameters that are in grid-level and needed to be put directly inside wrfinputs (change_to_SUEWS folder that I will explain next). You need to copy this file under WRF-SUEWS/wrfinput-processor/ and make sure the script in change_to_SUEWS has the right name for it. The other file (namelist.suews.new) contains the run-level parameters of SUEWS, and you need to put this file in the WRF-SUEWS run folder (change its name to namelist.suews). Note that you also need a namelist.suews under WRF-SUEWS/wrfinput-processor/ to run the script of this folder because it uses its structure to generate the new namelist file.

---> /change_to_SUEWS: the script in this folder modifies original wrfinputs and adds SUEWS related parameters to them. As I mentioned, it uses SUEWS_param.json under WRF-SUEWS/wrfinput-processor/. After running the script, you should copy new wrfinputs in the WRF-SUEWS run folder.

---> /London-Land-Cover-Modify: the script in this folder is just for the London run, and uses a high resolution land use fraction data to modify the third domain (London focused domain). If you are using the original land use data generated by WPS for Shanghai, you can ignore this folder; otherwise you can use it to modify your inputs for Shanghai.

Please let me know if you get to any difficulties running any of the scripts.

zhenkunl commented 5 years ago

I felt puzzled at the relationship among them this afternoon. It's very thoughtful for you to inform me of these promptly.

zhenkunl commented 5 years ago

Some errors occurred when I submitted real.exe to Jasmin using bsub < bsub_run_real. The error log showed as the following:

Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(784).................: MPID_Init(1326).......................: channel initialization failed MPIDI_CH3_Init(141)...................: dapl_rc_setup_all_connections_20(1396): generic failure with errno = 671107855 MPID_nem_dapl_get_from_bc(1309).......: Missing port or invalid host/port description in business card Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(784).................:

Any hints?

hamidrezaomidvar commented 5 years ago

This is the MPI problem of the Jasmin. It happens all the time for me. You should keep submitting it until it doesn’t give you this error. It happens at the very beginning when rsl files are not generated . If you reduce number of processors, it might help

On May 18, 2019, at 9:34 AM, Li Zhenkun notifications@github.com wrote:

Some errors occurred when I submitted real.exe to Jasmin using bsub < bsub_run_real. The error log showed as the following:

atal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(784).................: MPID_Init(1326).......................: channel initialization failed MPIDI_CH3_Init(141)...................: dapl_rc_setup_all_connections_20(1396): generic failure with errno = 671107855 MPID_nem_dapl_get_from_bc(1309).......: Missing port or invalid host/port description in business card Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(784).................:

Any hints?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

zhenkunl commented 5 years ago

I used less cores last night, but still failed. Maybe I should keep trying as you said.

zhenkunl commented 5 years ago

Another problem when run wrf.exe:

INITIALIZE SUEWS NAMELIST -------------- FATAL CALLED --------------- FATAL CALLED FROM FILE: LINE: 1270 ERROR reading sector coeff of namelist.suews

I used the WRF version under xx-test-xx-2, and namelist.suews was generated for Shanghai. I am wondering if the WRF version is too old to read in the namelist.suews correctly or namelist.suews has changed since WRF was compiled.

hamidrezaomidvar commented 5 years ago

This is the same problem I started getting since Thursday which I am suspecting it is also the problem of MPI since I used to run this without any problem. And now sometimes it works sometimes no! Try to chmod 444 namiste.suews and keep running it and let me know what it says

On May 18, 2019, at 10:59 AM, Li Zhenkun notifications@github.com wrote:

Another problem where run wrf.exe:

INITIALIZE SUEWS NAMELIST -------------- FATAL CALLED --------------- FATAL CALLED FROM FILE: LINE: 1270 ERROR reading sector coeff of namelist.suews

I used the WRF version under xx-test-xx-2, and namelist.suews was generated for Shanghai. I am wondering if the WRF version is too old to read in the name.suews correctly or namelist.suews has changed since WRF was compiled.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

zhenkunl commented 5 years ago

I tried many times. Sometimes jobs can be submitted successfully, but exit soon. The error files always say "ERROR reading sector coeff of namelist.suews". I suppose there might be something wrong with the code itself or wrf.exe(in xx-test-xx-2) is not consistent with the one in the repo.

hamidrezaomidvar commented 5 years ago

can you also change the number of processors (try different ones) and see if it works?

On May 18, 2019, at 7:16 PM, Li Zhenkun notifications@github.com wrote:

I tried many times. Sometimes jobs can be submitted successfully, but exit soon. The error files always say "ERROR reading sector coeff of namelist.suews". I suppose it might be something wrong with the code itself or wrf.exe(in xx-test-xx-2) is not consistent with the one in the repo.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

zhenkunl commented 5 years ago

I did attempt to change the number of processors, unfortunately the jobs cannot be submitted no matter what the number is. Even no .err or .out files are outputted. Can you have a try to see if it is a problem of Jasmin now?

hamidrezaomidvar commented 5 years ago

I am running 4 jobs right now. They are in the shared folder starting with Apr, Jul, Oct, and Jan. Try to copy any of them and run them to see if it works. It took me a while to have a successful run of them because of the recent problems.

On May 18, 2019, at 9:22 PM, Li Zhenkun notifications@github.com wrote:

I did attempt to change the number of processors, unfortunately the jobs cannot be submitted no matter what the number is. No .err or .out files are outputted. Can you have a try to see if it is a problem of Jasmin now?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

zhenkunl commented 5 years ago

I copied the Apr-London-Swindon folder to my own directory and changed the forcing data and namelist files(include namelist.input and namelist.suews). All the others remain the same. Then I submitted the job, however, it looked like I didn't do anything. No jobs can be found when execute jobs command, no logs are generated. It's really tricky!

hamidrezaomidvar commented 5 years ago

We need to find a solution for this instability. Let's work on it together on Monday and try to solve it.

zhenkunl commented 5 years ago

My wrf.exe run for a while, wrote the wrfout_d01 file for the outmost domain and then exited. I find one of the rsl.error. file ends with

-------------- FATAL CALLED --------------- FATAL CALLED FROM FILE: LINE: 29365 fatal error in SUEWS:Problem with (z-zd) and/or z0.

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 32

and one ends with

-------------- FATAL CALLED --------------- FATAL CALLED FROM FILE: LINE: 29365 fatal error in SUEWS:Inappropriate value calculated.

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 33

Is there anything I might have done wrong?

sunt05 commented 5 years ago

Might be something wrong with building/canopy height set in your wrfinput.

Sent from my iPhone

On 20 May 2019, at 17:24, Li Zhenkun notifications@github.com wrote:

My wrf.exe run for a while, wrote the wrfout_d01 file for the outmost domain and then exited. I find one of the rsl.error. file ends with

-------------- FATAL CALLED --------------- FATAL CALLED FROM FILE: LINE: 29365 fatal error in SUEWS:Problem with (z-zd) and/or z0.

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 32

and one ends with

-------------- FATAL CALLED --------------- FATAL CALLED FROM FILE: LINE: 29365 fatal error in SUEWS:Inappropriate value calculated.

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 33

Is there anything I might have done wrong?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

hamidrezaomidvar commented 5 years ago

Two questions: 1) how many time steps does it run? 2) what is the building height? 3) is the number of vertical grids 33? If it is the case, the first grid point is at around 50 m so if your building heights are more than this, it rises an error.

On May 20, 2019, at 5:24 PM, Li Zhenkun notifications@github.com wrote:

My wrf.exe run for a while, wrote the wrfout_d01 file for the outmost domain and then exited. I find one of the rsl.error. file ends with

-------------- FATAL CALLED --------------- FATAL CALLED FROM FILE: LINE: 29365 fatal error in SUEWS:Problem with (z-zd) and/or z0.

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 32

and one ends with

-------------- FATAL CALLED --------------- FATAL CALLED FROM FILE: LINE: 29365 fatal error in SUEWS:Inappropriate value calculated.

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 33

Is there anything I might have done wrong?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

zhenkunl commented 5 years ago

A: 1)Only one time step was outputted 2)bldgH_SUEWS = 35.9 in SUEWS_param_new.json 3)Yes, it is. So I need to change the bldgH_SUEWS to a lower value and modify wrfinput_d0* again, right?

zhenkunl commented 5 years ago

The bldgH_SUEWS was reset to 25 or 22 successively and still the same errors. Maybe try a little lower value?

sunt05 commented 5 years ago

heights of trees also matter. check these variables: EveTreeH_SUEWS and DecTreeH_SUEWS.

zhenkunl commented 5 years ago

Both of these two variables in London run are 13.1, and they are 9.1 and 10.9 respectively in my case. What's the direction?

sunt05 commented 5 years ago

then try to set a higher debug value to see what height the first/lowest atmospheric level is.

zhenkunl commented 5 years ago

Can you see something from the log?

d01 2012-12-01_00:00:00 after SuMin, qn_SUEWS= 76.0966107299998 d01 2012-12-01_00:00:00 after SuMin, qf_SUEWS= 0.000000000000000E+000 d01 2012-12-01_00:00:00 after SuMin, qs_SUEWS= 6.56633480928586 d01 2012-12-01_00:00:00 after SuMin, qh_SUEWS= -129.063718135778 d01 2012-12-01_00:00:00 after SuMin, qe_SUEWS= 198.593994056492 d01 2012-12-01_00:00:00 qn_out = 76.0966107299998 d01 2012-12-01_00:00:00 qf_out = 0.000000000000000E+000 d01 2012-12-01_00:00:00 qs_out = 6.56633480928586 d01 2012-12-01_00:00:00 qh_out = -129.063718135778 d01 2012-12-01_00:00:00 qe_out = 198.593994056492 d01 2012-12-01_00:00:00 First vertical level is 25.4766330718994 d01 2012-12-01_00:00:00 in SuMin, before calculation, OHM_coef: 0.718999981880188 0.718999981880188 0.718999981880188 0.718999981880188 0.194000005722046 0.194000005722046 0.194000005722046 0.194000005722046 -36.5999984741211 -36.5999984741211 -36.5999984741211 -36.5999984741211 d01 2012-12-01_00:00:00 Problem: In stability subroutine, (z-zd) < z0. d01 2012-12-01_00:00:00 ERROR! Program stopped: Problem with (z-zd) and/or z0. d01 2012-12-01_00:00:00 Values: 0.4766 3.6000 d01 2012-12-01_00:00:00 17 d01 2012-12-01_00:00:00 ERROR! SUEWS run stopped. -------------- FATAL CALLED --------------- FATAL CALLED FROM FILE: LINE: 29365 fatal error in SUEWS:Problem with (z-zd) and/or z0.

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 32

sunt05 commented 5 years ago

First vertical level is 25.4766330718994

This is very close to the surface. Might need to manipulate the eta levels for a higher first level.

Sent from my iPhone

On 20 May 2019, at 22:08, Li Zhenkun notifications@github.com wrote:

First vertical level is 25.4766330718994

hamidrezaomidvar commented 5 years ago

Or decrease the number of grids in the vertical directions in namelist.input

On May 20, 2019, at 10:41 PM, Ting Sun notifications@github.com wrote:

First vertical level is 25.4766330718994

This is very close to the surface. Might need to manipulate the eta levels for a higher first level.

Sent from my iPhone

On 20 May 2019, at 22:08, Li Zhenkun notifications@github.com wrote:

First vertical level is 25.4766330718994 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

zhenkunl commented 5 years ago

Eta levels have been decreased from 33 to 28. The error now becomes:

d01 2012-12-01_00:00:00 call cumulus_driver d01 2012-12-01_00:00:00 in cu_tiedtke d01 2012-12-01_00:00:00 returning from cumulus_driver d01 2012-12-01_00:00:00 call shallow_cumulus_driver d01 2012-12-01_00:00:00 calling inc/HALO_EM_FDDA_SFC_inline.inc d01 2012-12-01_00:00:00 call fddagd_driver d01 2012-12-01_00:00:00 call calculate_phy_tend d01 2012-12-01_00:00:00 call compute_diff_metrics d01 2012-12-01_00:00:00 calling inc/HALO_EM_TKE_C_inline.inc Fatal error in PMPI_Wait: A process has failed, error stack: PMPI_Wait(198)............: MPI_Wait(request=0x53a8d5c, status=0x7ffcbca84170) failed MPIR_Wait_impl(79)........: dequeue_and_set_error(933): Communication error with rank 29

Is this a MPI problem or not?

zhenkunl commented 5 years ago

I have run the model twice and errors are the same

sunt05 commented 5 years ago

Looks like so. I think nothing we help with this.

hamidrezaomidvar commented 5 years ago

But still I am seeing thez-zd<0 inrsl.out.024. Check this: grep "First vertical level" rsl.out.00* and look what is the lowest value. Maybe try to decrease the eta level of the second grid in your namelist.input lower than 0.90, and see what happens.

zhenkunl commented 5 years ago

It has been running for two or more time steps for all three domains and still continues. I can have a good sleep! Thank you all.

hamidrezaomidvar commented 5 years ago

Good Job! what was the final problem? is it still running?

zhenkunl commented 5 years ago

I modified the eta levels as you said then it succeeded. It is still running now.

sunt05 commented 5 years ago

@zhenkunl can we close this?

zhenkunl commented 5 years ago

Sure.

Urban-Meteorology-Reading / WRF-SUEWS

warm-up JASMIN tutorial for Zhenkun #39

-------------- FATAL CALLED --------------- FATAL CALLED FROM FILE: LINE: 29365 fatal error in SUEWS:Problem with (z-zd) and/or z0.

-------------- FATAL CALLED --------------- FATAL CALLED FROM FILE: LINE: 29365 fatal error in SUEWS:Inappropriate value calculated.