geoschem / geos-chem

GEOS-Chem "Science Codebase" repository. Contains GEOS-Chem science routines, run directory generation scripts, and interface code. This repository is used as a submodule within the GCClassic and GCHP wrappers, as well as in other modeling contexts (external ESMs).
http://geos-chem.org
Other
164 stars 156 forks source link

[BUG/ISSUE] Error creating restart file with GEOS-Chem 12.3.2 #167

Closed yantosca closed 4 years ago

yantosca commented 4 years ago

I am opening this issue on behalf of Rong Chien at U. Tennessee:

I am Rong-You Chien, Ph.D. Student from University of Tennessee, Knoxville, advised by Dr. Joshua Fu. We are currently working on the GEOS Chem 12.3.2 on the DOE's machine cori, and got some strange problems. After we finished our run, and tempted to get the GEOSChem.Restart nc file, it mostly will generate an unknown file format, and failed with an error message,


-----------------------  
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

In Nccr_Wr, cannot create: GEOSChem.Restart.20140801_0000z.nc4

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Code stopped from DO_ERR_OUT (in module NcdfUtil/m_do_err_out.F90)


>But sometimes, it will pass. For most of the case, it will pass when I submit the job using the same settings with the same directory, leaving the original fail nc4 file, and changed nothing.
>
>Even if we get a readable nc file, we can see the values using ncdump, we can not use that restart file to trigger the next month and got the error message,

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

In Ncop_Rd, cannot open: ./GEOSChem.Restart.20140201_0000z.nc4

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Code stopped from DO_ERR_OUT (in module NcdfUtil/m_do_err_out.F90)


>It's pretty weird for us, since it can read those ExtData download from GEOS Chem server, but had issues on the output files.
>
>Have you heard any problems like this? Or how can we solve this?

Files attached:

- [GEOS_CHEM_one.err.26644580.txt](https://github.com/geoschem/geos-chem/files/3946771/GEOS_CHEM_one.err.26644580.txt)
- [GEOS_CHEM_one.out.26644580.txt](https://github.com/geoschem/geos-chem/files/3946774/GEOS_CHEM_one.out.26644580.txt)
 - [run_geoschem_one.sh.txt](https://github.com/geoschem/geos-chem/files/3946776/run_geoschem_one.sh.txt)
yantosca commented 4 years ago

I noticed that you are trying to run GEOS-Chem "Classic" on more than one node. The output in your log file shows the code repeatedly trying to print out the information on 4 nodes:

*************   S T A R T I N G   2 x 2.5   G E O S - C H E M   *************

===> Mode of operation         : GEOS-Chem "Classic"
===> GEOS-Chem version         : 12.3.2
===> Compiler                  : Intel Fortran Compiler (aka ifort)
===> Driven by meteorology     : GMAO GEOS-FP (on native 72-layer vertical grid)
*************   S T A R T I N G   2 x 2.5   G E O S - C H E M   *************

===> Mode of operation         : GEOS-Chem "Classic"
===> GEOS-Chem version         : 12.3.2
===> Compiler                  : Intel Fortran Compiler (aka ifort)
===> Driven by meteorology     : GMAO GEOS-FP (on native 72-layer vertical grid)
*************   S T A R T I N G   2 x 2.5   G E O S - C H E M   *************

===> Mode of operation         : GEOS-Chem "Classic"
===> GEOS-Chem version         : 12.3.2
===> Compiler                  : Intel Fortran Compiler (aka ifort)
===> Driven by meteorology     : GMAO GEOS-FP (on native 72-layer vertical grid)
===> ISORROPIA ATE package     : ON
===> Parallelization w/ OpenMP : ON
===> Binary punch diagnostics  : ON
===> netCDF diagnostics        : ON
===> ISORROPIA ATE package     : ON
===> Parallelization w/ OpenMP : ON
===> ISORROPIA ATE package     : ON
===> Parallelization w/ OpenMP : ON
===> netCDF file compression   : SUPPORTED
===> Binary punch diagnostics  : ON
===> netCDF diagnostics        : ON
===> Binary punch diagnostics  : ON
===> netCDF diagnostics        : ON
===> netCDF file compression   : SUPPORTED
===> netCDF file compression   : SUPPORTED

In your job script, please reduce the number of nodes from 4 to 1 and the number of CPUs per task from 272 to say 24, and try again. I suspect this might be causing your restart file issue.

#### Batch system directives
#SBATCH  -A acme
#SBATCH  --qos=regular
#SBATCH  --nodes=1
#SBATCH  --time=03:00:00
#SBATCH  --cpus-per-task=24
#SBATCH  --constraint=knl,quad,cache
#SBATCH  --exclusive
#SBATCH  --job-name=GEOS_CHEM_multi_run
#SBATCH  --output=GEOS_CHEM_one.out.%j
#SBATCH  --error=GEOS_CHEM_one.err.%j

GEOS-Chem "Classic" is only able to be run on a single node. But you can run GEOS-Chem with the high-performance option (GCHP) on multiple nodes using MPI parallelization. See our Getting Started with GCHP guide for more information.

yantosca commented 4 years ago

Also with SLURM you can set the OMP_NUM_THREADS to be the same as the number of CPUs that you have requested. At the top of the script you can add:

#SBATCH --cpus-per-task=24

or equivalently

#SBATCH -c 24

Then in your script, where you set the environment variables, type:

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

That will make sure you run with the same number of CPUs that you have requested via SLURM.

pshchien commented 4 years ago

Hello Dr. Yantosca, One more question, for our machine, we had 68 cores in one node, each of them with 4 threads. In this case, can we use all of them with the classical version of GEOS-Chem? Or the limit CPUs is 24?

Rong-You Chien

yantosca commented 4 years ago

You can use all 68 cores per node. But be aware that you might get better performance using less cores with GEOS-Chem "Classic". See http://wiki.geos-chem.org/GEOS-Chem_scalability.

Because GEOS-Chem "Classic" uses OpenMP parallelization, the restriction is that all of the cores have to be able to see all of the memory on the node. So when you add more cores there is more communication, and that extra overhead starts to dominate at higher core counts. This is the reason we are also developing GCHP, which uses MPI parallelization, and which can take advantage of multiple nodes on a cluster.

pshchien commented 4 years ago

Hello Dr. Yantosca, I see. I will try to use 68 cores first, and also try to use the GCHP for testing.

Rong-You Chien

yantosca commented 4 years ago

Thanks! I will go ahead and close out this issue. Feel free to add another issue if you still run into problems.

pshchien commented 4 years ago

Hello Dr. Yantosca, I tried again with one node classic geos-chem version 12.3.2, and successfully generate the restart file while again, I failed in reading that.

I manually put the GEOSChem.Restart to the running directory, but I got the same error message again. I noticed that the restart file I generate is smaller than the initial file I downloaded from Harvard's ftp. Would that be the reason?

Attached is the log file I generate. One day simulation is the run I got the restart file, I just simulate one day. And then, I rename that into ./GEOSChem.Restart.20150101_0000z.nc4 and ask the new run to use that.

Looking forward to hearing from you,

Rong-You Chien

geos_chem.zip

yantosca commented 4 years ago

The input.geos file in your one_day run shows:

GEOS-CHEM UNIT TEST SIMULATION: geosfp_2x25_standard
------------------------+------------------------------------------------------
%%% SIMULATION MENU %%% :
Start YYYYMMDD, hhmmss  : 20150101 000000
End   YYYYMMDD, hhmmss  : 20150102 000000

which indicates a 1-day run.

But the input.geos file in your new_run folder shows:

GEOS-CHEM UNIT TEST SIMULATION: geosfp_2x25_standard
------------------------+------------------------------------------------------
%%% SIMULATION MENU %%% :
Start YYYYMMDD, hhmmss  : 20150101 000000
End   YYYYMMDD, hhmmss  : 20150201 000000

So I am guessing you might have intended the new run to start from 20150102 instead of from 20150101. That is why it can't find the restart file.... the restart file from the one_day run has date 20150102 not 20150201.

pshchien commented 4 years ago

Hello, Dr.Yantosca, Yes, so I renamed that restart file as to 20150101 in the new run. Or that will also fail in reading the file?

Rong-You Chien

lizziel commented 4 years ago

Hi Rong-You, GEOS-Chem allows input restart files to be renamed from a different day. The time content of the data will not match the date indicated in the filename, but that should not cause an issue. Have you tried opening the netcdf file with a netcdf viewer or other software that can read netcdf? I suggest doing this to make sure the file is not corrupt.

pshchien commented 4 years ago

Hello, Dr. Lundgren, Although i did not see the content, but I can see the file title via ncdump. So that's why i am curious about thus question.

Rong-You Chien

pshchien commented 4 years ago

Hello, Dr. Lundgren, And I also can read the file from matlab using ncread. Or would you suggest me to check which variables?

Sincerely, Rong-You Chien

lizziel commented 4 years ago

If you can successfully read with matlab and do ncdump then it sounds like the file is at least mostly valid. Another thing you can do is print out the netcdf error code from GEOS-Chem. In the subroutine and file referenced in your error message (Ncop_Rd in m_do_err_out.F90) there is an error return variable called ierr. It should be zero with successful read, but it is being returned as not zero and triggering the error message you see. You can go into that file, add a print for ierr, and see what you get. Then look at the netcdf error codes online to see which one it corresponds to: https://www.unidata.ucar.edu/software/netcdf/docs/nc-error-codes.html.

pshchien commented 4 years ago

Hello Dr. Lundgren, I got the error code 231.

define NC_ETYPESIZE_MISMATCH (-231) // file type size mismatches buffer type size

So it seems that the file size is still the problem?

What is the usual file size for the restart file? My restart file size is 619 MB. While the file I download from the Harvard server is 1001MB.

And the variable name is also different. Is there any restart file I can compare with the one I generate from GEOS-Chem v12?

Rong-You Chien

lizziel commented 4 years ago

It seems like this is a corrupt file issue, although I am not certain. What do you mean by the variable name is also different? What is the list of variables that you get when you do ncdump -h?

We do not have any additional 2x2.5 restart files available beyond the initial files available for download. However, you can look at our 4x5 restart files generated during GEOS-Chem benchmarking. For example, 12.2.0 1-month 4x5 benchmark output is available at http://ftp.as.harvard.edu/gcgrid/geos-chem/1mo_benchmarks/GC_12/12.2.0/.

Have you tried generating a restart file for lower resolution and/or other simulations? I am wondering if the problem is your system or your config files. If you have time, try a 4x5 run with the transport tracer simulation. That simulation is very lightweight and fast. See if you can start a new simulation with the output of it.

pshchien commented 4 years ago

Hello Dr. Lundgren, Attached is the species name I found for the restart file from GEOS-Chem and the initial file I download from ExtData/BPCH_RESTART, which is linked by UT for simulation.

I got Met*, Chem and SpeciesRst_ in the restart file, while I do not have SpeciesRst_ASO??, ISO??, and TSO??, which I can find in the initial data.

After some further testing, I would like to make some clarifications on this question. Currently, when I copy the GESOChem.Restart file to a new folder which was generated from the UT, I will get that error message. While if I use the original folder and keep running the simulation, then I will pass. Even if I had the same settings in input.geos, HEMCO_Config.rc and HISTORY.rc.

Did GEOS-Chem also read other files in continue run to prevent the corrupt?

I will try to use the coarser grid to compare it with the benchmark.

Thanks for your kindly help,

Rong-You Chien Species_Name.xlsx

lizziel commented 4 years ago

The extra variables you see in the initial restart file that are missing in your output restart file are okay. They are used in other simulations than the one you are running, specifically when complex SOA is turned on. If a species in the input restart file is not needed for the simulation then it is simply ignored.

My new understanding from your last response is that your output restart file works as an input restart file if you reuse the same run directory used to generate it. Is this correct?

yantosca commented 4 years ago

It looks like the initial restart file you show contains the extra secondary organic aerosol species. But the species in your logfile output seem to indicate that you are running a standard simulation (not complexSOA). Can you confirm that the restart file is the proper one for the simulation that you are trying to perform?

Also: this error seems to indicate that you might be reading a real*4 netCDF file variable into a real*8 variable, or a integer*4 into an integer*8:

#define NC_ETYPESIZE_MISMATCH           (-231) // file type size mismatches buffer type size

Are you using a Cray machine? I think by default it might set INTEGER to INTEGER*8.

lizziel commented 4 years ago

@yantosca Our restart files for benchmark (w/ complex SOA) and standard are the same. The complex SOA species are ignored when running standard simulations.

yantosca commented 4 years ago

Oh right. Forgot about that.

The error still seems to be a problem with reading in the wrong data type.

pshchien commented 4 years ago

@lizziel Not exactly. I reuse the same run directory, with a new time tag, so I am doing that as a continuous run. So I did not change the date for restart file. I haven't check if I can run when re-name the restart file to a different date.

@yantosca I think I am using the Intel Xeon-Phi process. But in this case, why I can get the GEOS-Chem read the restart file when I use the same directory? Will GEOS-Chem detect whether I am running a new simulation or running a continuous run?

Rong-You Chien

yantosca commented 4 years ago

@pshchien Which compiler are you using? Can you attach the lastbuild file from the run directory?

yantosca commented 4 years ago

A couple more things:

  1. Cori is indeed a Cray system. Sometimes things on Cray UNICOS are slightly different than most Linux systems. See: https://docs.nersc.gov/systems/cori/

  2. I noticed you are using "#SBATCH --constraint=knl,quad,cache" which is requesting the Knights Landing (KNL) chips. This chipset might not be compatible with a simple serial parallelized code like GEOS-Chem in "Classic" mode. You might try: "#SBATCH --constraint=haswell", which will use an older Intel chipset (Haswell). We develop on Haswell chips, and they give good performance with GEOS-Chem. For more info about KNL, see: https://docs.nersc.gov/performance/knl/getting-started/

pshchien commented 4 years ago

Dr. Yantosca, Sorry, since I saw the node information and saw the processor as Intel. Here are my lastbuild.mp. I used ftn, while setting

      COMPILER_FAMILY    :=Intel
      USER_DEFS          += -DLINUX_IFORT

in Code/Makefile_header.mk

Thanks for your advice. I will check to see if I can use the haswell with my account.

Rong-You Chien

lastbuild.mp.log

yantosca commented 4 years ago

In your lastbuild file you have:

COMPILER     : ftn 1985-2019

So this is a wrapper for the Cray compiler. See: https://docs.nersc.gov/programming/compilers/wrappers/. I wonder what compiler is underneath it. Can you do:

ftn --version

and then paste the results here? If ftn is wrapping to the Cray Fortran compiler instead of the Intel fortran compiler, then this may be causing the issue. I have worked on Crays before and I know the Cray compilers use INTEGER*8 for INTEGER and REAL*8 for REAL by default. That might be causing your issue.

yantosca commented 4 years ago

Also -- you might need to use the native ifort compiler instead of the wrapped one. See here: https://docs.nersc.gov/programming/compilers/native/. You can ask your IT people for more info about this.

pshchien commented 4 years ago

Hello Dr. Yntosca, I used the intel compiler,

ifort (IFORT) 19.0.3.199 20190206
Copyright (C) 1985-2019 Intel Corporation.  All rights reserved.

I will try to use the native ifort, since I used to try that on cori, but fail, they suggest me to switch into ftn then set the compiler family to solve that question.

Rong-You Chien

yantosca commented 4 years ago

My hunch is that the ftn wrapper might have some pre-selected settings that are optimized for MPI parallelization. This might not be necessary for GEOS-Chem "Classic". You should be able to see the compiler options in the log file.

yantosca commented 4 years ago

@pshchien Just checking in. Are you still having problems getting GEOS-Chem to compile on your system? If not, we can close out this issue.

pshchien commented 4 years ago

@yantosca I still can only use the ftn to compile the GEOS-Chem. But for now, I think we can close this issue. Thanks for your help.

Rong-You Chien

yantosca commented 4 years ago

Thanks. I think this is not so much a GEOS-Chem issue as it is a local cluster setup issue. I'll go ahead and close this out.