NorESMhub / noresm2cmor

A command line tool for cmorizing NorESM output
http://noresmhub.github.io/noresm2cmor/
5 stars 16 forks source link

[CMIP6 CMOR-ization & ESGF-publication] NorESM2-MM - piControl #140

Closed matsbn closed 1 year ago

matsbn commented 4 years ago

Mandatory information:

Full path to the case(s) of the experiment on NIRD /projects/projects/NS9560K/noresm/cases /projects/projects/NS9560K/FRAM/noresm/cases

experiment_id piControl

model_id NorESM2-MM

CASENAME(s) and years to be CMORized N1850frc2_f09_tn14_20191001, 1200-1299 N1850frc2_f09_tn14_20191012, 1300-1449 N1850frc2_f09_tn14_20191113, 1450-1699

Optional information

parent_experiment_id piControl-spinup

parent_experiment_rip r1i1p1f1

parent_time_units 'days since 0001-01-01'

branch_method 'Hybrid-restart from year 1200-01-01 of piControl-spinup'

other information

matsbn commented 4 years ago

The full path to the case(s) of the experiment on NIRD should be

/projects/NS9560K/noresm/cases /projects/NS9560K/FRAM/noresm/cases

with case N1850frc2_f09_tn14_20191001 in /projects/NS9560K/noresm/cases and N1850frc2_f09_tn14_20191012 in /projects/NS9560K/FRAM/noresm/cases.

YanchunHe commented 4 years ago

A not on the post-processing of the MM experiments:

The processing of the NorESM2-MM experiments are slow due to two factors now:

One reason is of course the high-resolution and high-frequency output.

Another reason is that the cmor tool crashed many times at some arbitrary points.

with normally a simple error as HDF error

or something more in detail as:

*** Error in `./noresm2cmor3': free(): invalid pointer: 0x00002b3f39edcd68 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81489)[0x2b3f39b97489]
/opt/hdf5-1.10.2-intel/lib/libhdf5.so.101(H5MM_xfree+0xb)[0x2b3f3d80c38b]
/opt/hdf5-1.10.2-intel/lib/libhdf5.so.101(+0x204d3d)[0x2b3f3d848d3d]
/opt/hdf5-1.10.2-intel/lib/libhdf5.so.101(H5O_msg_reset+0x62)[0x2b3f3d84b2c2]
/opt/hdf5-1.10.2-intel/lib/libhdf5.so.101(H5G__link_release_table+0x4f)[0x2b3f3d7b042f]
/opt/hdf5-1.10.2-intel/lib/libhdf5.so.101(H5G__dense_iterate+0xac)[0x2b3f3d7a65dc]
/opt/hdf5-1.10.2-intel/lib/libhdf5.so.101(H5G__obj_iterate+0x131)[0x2b3f3d7b93d1]
/opt/hdf5-1.10.2-intel/lib/libhdf5.so.101(H5G_iterate+0xe6)[0x2b3f3d7ad886]
/opt/hdf5-1.10.2-intel/lib/libhdf5.so.101(H5Literate+0x12c)[0x2b3f3d7f9f0c]
/opt/netcdf-4.6.1-intel/lib/libnetcdf.so.13(+0xeed38)[0x2b3f38acdd38]
/opt/netcdf-4.6.1-intel/lib/libnetcdf.so.13(NC4_open+0x2ee)[0x2b3f38acf02e]
/opt/netcdf-4.6.1-intel/lib/libnetcdf.so.13(NC_open+0x28f)[0x2b3f38a0c99f]
/opt/netcdf-4.6.1-intel/lib/libnetcdf.so.13(nc_open+0x17)[0x2b3f38a0c707]
/opt/netcdf-4.6.1-intel/lib/libnetcdff.so.6(nf_open_+0x9c)[0x2b3f3856ce7c]
/opt/netcdf-4.6.1-intel/lib/libnetcdff.so.6(netcdf_mp_nf90_open_+0x132)[0x2b3f38597fc2]

...

ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source
noresm2cmor3       00000000005EB46A  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B3F399095D0  Unknown               Unknown  Unknown
libc-2.17.so       00002B3F39B4C207  gsignal               Unknown  Unknown
libc-2.17.so       00002B3F39B4D8F8  abort                 Unknown  Unknown
libc-2.17.so       00002B3F39B8ED27  Unknown               Unknown  Unknown
libc-2.17.so       00002B3F39B97489  Unknown               Unknown  Unknown
libhdf5.so.101.1.  00002B3F3D80C38B  H5MM_xfree            Unknown  Unknown
libhdf5.so.101.1.  00002B3F3D848D3D  Unknown               Unknown  Unknown
libhdf5.so.101.1.  00002B3F3D84B2C2  H5O_msg_reset         Unknown  Unknown
libhdf5.so.101.1.  00002B3F3D7B042F  H5G__link_release     Unknown  Unknown
libhdf5.so.101.1.  00002B3F3D7A65DC  H5G__dense_iterat     Unknown  Unknown
libhdf5.so.101.1.  00002B3F3D7B93D1  H5G__obj_iterate      Unknown  Unknown
libhdf5.so.101.1.  00002B3F3D7AD886  H5G_iterate           Unknown  Unknown
libhdf5.so.101.1.  00002B3F3D7F9F0C  H5Literate            Unknown  Unknown
libnetcdf.so.13.1  00002B3F38ACDD38  Unknown               Unknown  Unknown
libnetcdf.so.13.1  00002B3F38ACF02E  NC4_open              Unknown  Unknown
libnetcdf.so.13.1  00002B3F38A0C99F  NC_open               Unknown  Unknown
libnetcdf.so.13.1  00002B3F38A0C707  nc_open               Unknown  Unknown
libnetcdff.so.6.1  00002B3F3856CE7C  nf_open_              Unknown  Unknown
libnetcdff.so.6.1  00002B3F38597FC2  netcdf_mp_nf90_op     Unknown  Unknown
noresm2cmor3       000000000048B937  m_utilities_mp_ge         791  m_utilities.F
noresm2cmor3       000000000048A387  m_utilities_mp_sc         686  m_utilities.F
noresm2cmor3       00000000004DFF5B  Unknown               Unknown  Unknown
noresm2cmor3       000000000055451C  MAIN__                     55  noresm2cmor.F
noresm2cmor3       000000000040DE6E  Unknown               Unknown  Unknown
libc-2.17.so       00002B3F39B383D5  __libc_start_main     Unknown  Unknown
noresm2cmor3       000000000040DD69  Unknown               Unknown  Unknown

It looks it crashes during NC file reading. But I think this should not a problem of the file itself.

What I plan to do is to try to change the optimisation from -O2 to -O0 in the compiler flag.

Ingo, do you agree and have any idea? @IngoBethke

IngoBethke commented 4 years ago

I experienced Matlab and nco crashes on NIRD today and wonder whether the login nodes had some resource issues today.

-O2 is usually safe, so I don't think -O0 will have any effect other than making the code slow. In any case, the crash above was in the HDF5 library which is compiled with -O2.

It is hard to identify the problem if the crashes do not occur at the same point. If the crashes occur only for long simulations with extensive file scanning then it can be worthwile checking the code for missing netcdf close statements (very typical bug). There is a user limit for how many open files a user can have at the same time (check with ulimit -a) and if you are running several instances of the tool in parallel then a missing close statement can cause a crash at a seemingly arbitrary position.

First I would recommend to test just running a single instance of noresm2cmor3 per node and use top to monitor the memory consumption.

If the crashes occurs always during reading of the same input file then I usually use ncdump (ideally from the same library installation as used in noresm2cmor) to dump the entire content of the input file. In most cases this will reproduce the problem.

monsieuralok commented 4 years ago

Hi Yanchun, Could you come to my office we can bit try to debug it? Alok

YanchunHe commented 4 years ago

Many thanks for the reply, Ingo!

I experienced Matlab and nco crashes on NIRD today and wonder whether the login nodes had some resource issues today.

This happened not only yesterday, but quite for quite some days. So it should not a problem of the disk.

-O2 is usually safe, so I don't think -O0 will have any effect other than making the code slow. In any case, the crash above was in the HDF5 library which is compiled with -O2.

It is hard to identify the problem if the crashes do not occur at the same point. If the crashes occur only for long simulations with extensive file scanning then it can be worthwile checking the code for missing netcdf close statements (very typical bug). There is a user limit for how many open files a user can have at the same time (check with ulimit -a) and if you are running several instances of the tool in parallel then a missing close statement can cause a crash at a seemingly arbitrary position.

This is the MM piControl run, the simulation is not that long compared to some other simulations. And each cmor tool process only 10 years data. Some jobs can finish successfully for some 10-yr spans, but some others can't.

So don't know if in such situation, the 'netcdf close' statements matters? I see the ulimit -n is 1048576, so that is not strict limit for file numbers.

Each time I only submit 8 cmor tasks, either 8 parallel threads or 8 different serial jobs. But this problem can happen in both situations.

First I would recommend to test just running a single instance of noresm2cmor3 per node and use top to monitor the memory consumption.

I also tried one single instance of noresm2cmor3, it also failed. But looks like due to other temporary disk problem Stale file handle

It is hard to monitor the memory consumption, since it take quite long until crash. But may can use some automatically logging of the memory consumption.

If the crashes occurs always during reading of the same input file then I usually use ncdump (ideally from the same library installation as used in noresm2cmor) to dump the entire content of the input file. In most cases this will reproduce the problem.

I will check if this is reproducible, e..g, crashes occur during reading the same file.

YanchunHe commented 4 years ago

Hi Yanchun, Could you come to my office we can bit try to debug it? Alok

good, I will talk to you around kl13:00

YanchunHe commented 4 years ago

I also tried one single instance of noresm2cmor3, it also failed. But looks like due to other temporary disk problem Stale file handle

btw, the stale file handle problem occurs just because the temporary fram:/cluster/NS9560K is not minutely properly to nird mount point /projects/NS9560K/FRAM/

I will try to change to other login node of nird.

YanchunHe commented 4 years ago

Hi Ingo and Alok,

I tried again both 8 mpi tasks for historical run and one serial task for picontrol. Both of them now succeed finishing the job.

I monitored the maximum memory occupation, and the mpi threads take at most 3.0 GB and serial task takes at most 6.5 GB. Therefore, there should be no memory leak for this case. And we don't need to debug on this now. @monsieuralok

I suspect the 'HDF error' problem is likely caused by the instability of mounted temporary disk from FRAM to nird:/projects/NS9560K/FRAM. Since this is quite unstable as I noticed.

During some days of the last week and the weekend, the /projects/NS9560K/FRAM was only mounted to login0 node of nird, so I can only run the post-processing on the login0 node for MM (and some other LM experiments). I wrote to sigma2 support, and now they are available on other nodes.

The progress of post-processing MM experiments should hopefully faster this week. @matsbn .

YanchunHe commented 4 years ago

This 'HDF error' still occurs very often, for those experiments stored on the temporary /projects/NS9560K/FRAM, for both the NorESM2-MM and NorESM2-LM experiments.

I strongly suspect this is due to instability during data read from these data.

I will launch a noresm2cmor task with debug-mode on, maybe @monsieuralok can help to debug into this.

YanchunHe commented 4 years ago

I submitted the job before lunch, it crashed again at ca. kl.12.

At some point before that, I used ls command, and it again showed error on file system:

yanchun@login-nird-0:~
$ ls
ls: cannot access ftp: Stale file handle
ls: cannot access workshop: Stale file handle
ls: cannot access mld_diff_new-old.nc: Stale file handle
ls: cannot access logs: Stale file handle
ls: cannot access archive: Stale file handle
ls: cannot access cmor2.log.v20191108b: Stale file handle
ls: cannot access Datasets: Stale file handle
ls: cannot access mld_diff_new-old.pdf: Stale file handle
ls: cannot access mld_diff_new-old.png: Stale file handle
...

This Stale file handle happens very often, I am afraid to say.

The noresm2cmor program aborted, this time very likely due to this.

But there are no error reporting. (no HDF error either this time). Log files are:

We have to find another solution to this experiments stored on /projects/NS9560K/FRAM, otherwise, it wastes too much resources and time but crashes again and again.

wondering if possible to run the noresm2cmor on FRAM and transfer data to NIRD?

Or ideally transfer these model output to NIRD at some places, and them delete?

Or wait until the new storage in NIRD for these data.

Any other ideas?

oyvindseland commented 4 years ago

I am sorry I have not followed this discussion well, but saw it now due to the fact that I was asked today about when any of the NorESM scenarios might be found of ESGF.

wondering if possible to run the noresm2cmor on FRAM and transfer data to NIRD?

How much work is it to make the script work? Anyone having an idea.

Or ideally transfer these model output to NIRD at some places, and them delete?

Is the temporary disk stable enough to copy from nird to fram, e.g. LM control to get some free space on nird? Probably should use rsync via internet and not try to copy the data the data directly Or do we need to rsync them to the work disk on Fram and then to the temporary disk of curse checking that the data is kept all the time?

Or ideally transfer these model output Run noresm2cmor locally at Nersc, Norce or MET, i.e copying the data to local disks?

Or wait until the new storage in NIRD for these data.

Probably the best solution but uncertain time-line and not good for use of NorESM2 data in the mips

oyvindseland commented 4 years ago

Made a wrong citation for one of the suggestions

Or ideally transfer these model output Run noresm2cmor locally at Nersc, Norce or MET, i.e copying the data to local disks?

YanchunHe commented 4 years ago

I would like to copy the data to NIRD temporary.

There are 260 disk quota of NS9034K, and it is now used 200T for cmorized data.

I don't know if this is allowed for this project? @IngoBethke

If so, maybe Jan can help to copy the experiments there, and I can do the experiment.

I can come with a detailed list of experiments (or partially some of the years of the experiments).

matsbn commented 4 years ago

I think copying data temporary to NS9034K could be an idea and I actually discussed this option with @oyvindseland this afternoon. It will be a balancing act of space used for raw data and space needed for the cmorized output.

YanchunHe commented 4 years ago

I think copying data temporary to NS9034K could be an idea and I actually discussed this option with @oyvindseland this afternoon. It will be a balancing act of space used for raw data and space needed for the cmorized output.

This sounds good! But you may soon need to ask for more space for this project NS9034K

Mats, would you invite/ask Jan join this repository, so that he can subscribe and be notified.

I will update in different issues if they need to be copied to NS9034K.

YanchunHe commented 4 years ago

The following period of piControl model outoput (see path and cases names in the first thread above) needs to be copied to: /tos-project1/NS9034K/noresm/cases

The first and second years indicate the start and ending period, e.g., 1320 1329 means all years from ((including)) 1320 to 1329.

1320 1329 1330 1339 1340 1349 1350 1359 1360 1369 1370 1379 1410 1419 1420 1429 1430 1439 1440 1449

files should be organized as the same folder structures as the original model output.

Experiments need rsync to NIRD NS9034 are labelled with Rsync

YanchunHe commented 4 years ago

Wondering if you prefer to sync all model output to NS9034K or only those years that are not cmorized successfully? @matsbn

jgriesfeller commented 4 years ago

Hi, JFYI, at this point I am not part of the NS9034K group on nird and can therefore not write to /tos-project1/NS9034K/noresm/cases.

jgriesfeller commented 4 years ago

I ran a very tiny speed test for transferral between FRAM and NIRD.

Over the internet I roughly get 90MB/sec, using the NFS mount I get roughly 170MB/sec.

In any case transferring the 25TB of thN1850frc2_f09_tn14_20191012e directory will take a significant amount of time.

matsbn commented 4 years ago

I have added @jgriesfeller to the NS9034K project. Before Sigma2 have more disks installed, the chance of having more space for NS9034K is very limited. It is the same reason we are out of space on NS9560K and are dealing with the temporary /cluster/NS9560K solution.

jgriesfeller commented 4 years ago

Thanks Mats, I can write to /tos-project1/NS9034K/noresm/cases now.

Shall I transfer the data now or not? Do we really need all 25TB?

I also wonder if I should tell sigma2 the experience with NFS mounts here at met. Basically NFS4 showed similar problems here while NFS3 was much more stable. What do you think?

jgriesfeller commented 4 years ago

Just to summarise what Mats and I have just talked on the phone about in conjuntion with this thread: I will transfer the data for the needed years to /tos-project1/NS9034K/noresm/cases keeping the current file structure on Fram, but only copying the years needed. Since N1850frc2_f09_tn14_20191001 is not on Fram anymore, I will just do that for N1850frc2_f09_tn14_20191012.

matsbn commented 4 years ago

I slightly embarrassing fact is that the first 50 years of N1850frc2_f09_tn14_20191012 is actually on NIRD already. This means the time slices 1320-1329, 1330-1339, 1340-1349 can be found under /projects/NS9560K/noresm/cases/N1850frc2_f09_tn14_20191012. I gave the path to the full dataset on /cluster/NS9560K since I initially assumed the availability of the complete dataset was more convenient for the processing. When the unstable mount issues appeared, I failed to see that part of this experiment could be more efficiently processed using the already transferred data. Sorry about that!

A significant portion (100 of 120 years) of 1pctCO2 and abrupt-4xCO2 NorESM2-MM experiments are also on NIRD. I will comment under the relevant CMOR and ESGF publishing requests about this.

matsbn commented 4 years ago

It should be 100 of 150 years already transferred to NIRD for 1pctCO2 and abrupt-4xCO2 NorESM2-MM experiments.

jgriesfeller commented 4 years ago

(forgot to post yesterday; but for completeness I post it now anyway) last update for today: I have started the download for the years 1300 to 1449. I wrote a script that just goes through the years and finds the files matching the year. These are then put into a text file that rsync uses to download the files. I am not entirely sure if I caught all files needed. Yanchun, maybe you can check that at /tos-project1/NS9034K/noresm/cases/N1850frc2_f09_tn14_20191012 tomorrow. I might also test if I get more speed with several rsync instances at a time tomorrow.

jgriesfeller commented 4 years ago

The transfer was not complete this morning, and there's missing files that were in the file list and I know now that I did not catch all files in the rest folder.

Still some work to do.

YanchunHe commented 4 years ago

I slightly embarrassing fact is that the first 50 years of N1850frc2_f09_tn14_20191012 is actually on NIRD already. This means the time slices 1320-1329, 1330-1339, 1340-1349 can be found under /projects/NS9560K/noresm/cases/N1850frc2_f09_tn14_20191012. I gave the path to the full dataset on /cluster/NS9560K since I initially assumed the availability of the complete dataset was more convenient for the processing. When the unstable mount issues appeared, I failed to see that part of this experiment could be more efficiently processed using the already transferred data. Sorry about that!

A significant portion (100 of 120 years) of 1pctCO2 and abrupt-4xCO2 NorESM2-MM experiments are also on NIRD. I will comment under the relevant CMOR and ESGF publishing requests about this.

This sounds very good! I will update the script, so that it will use local data for these available years.

YanchunHe commented 4 years ago

The transfer was not complete this morning, and there's missing files that were in the file list and I know now that I did not catch all files in the rest folder.

Still some work to do.

This 'rest' folder is not required for cmorization, as I know.

But It will be great if you check if the files are completely transferred. I have limited access to NIRD and the connection is not usually so good!

jgriesfeller commented 4 years ago

Good to know that the rest folder is not needed. This will speed up things a little.

Based on the last 12 hours I just did a very rough time estimation: The script was able to transfer ~4TB within 12 hours, so ~8TB per day. An LM simulation (with the rest folder) contains ~15TB of data, a MM simulation ~25TB.

Dirk gave me a list of simulations to cmorise with priority:

  1. the two LM (f19) scenarios left to be done : ssp370 : NSSP370frc2_f19_tn14_20191014 ssp370-lowNTCF : NSSP370LOWNTCFfrc2_f19_tn14_20191118

  2. there are four MM (f09) scenarios left to be done : ssp126 : NSSP126frc2_f09_tn14_20191105 ssp245 : NSSP245frc2_f09_tn14_20191105 ssp370 : NSSP370frc2_f09_tn14_20191105 ssp585 : NSSP585frc2_f09_tn14_20191105

  3. a second and third LM member of point 1 : ssp370 (member 2) : NSSP370frc2_02_f19_tn14_20191118 ssp370 (member 3) : NSSP370frc2_03_f19_tn14_20191118 ssp370-lowNTCF (member 2) : NSSP370LOWNTCFfrc2_02_f19_tn14_20191118 ssp370-lowNTCF (member 3) : NSSP370LOWNTCFfrc2_03_f19_tn14_20191118

all in all 6 LM simulations and 4 MM simulations. 190TB to transfer needing (based on the last 12 hours transfer rate) 23.75 days for the transferral alone. According to Mats, the cmorised data needs roughly 50% of the space of the original model data. So we would need 95TB for the cmorised data.

Can we fit all that into the 55TB quota we have left? Any comments on this?

YanchunHe commented 4 years ago

I think we can rsync two/three experiments each time. After finish syncing, doing cmorization. and at the same time, rsync the other two/three (remove the cmorized).

jgriesfeller commented 4 years ago

N1850frc2_f09_tn14_20191012: years 1350 to 1449 are at /tos-project1/NS9034K/noresm/cases/N1850frc2_f09_tn14_20191012 now

Do we have all needed years for the cmorisation now?

YanchunHe commented 4 years ago

I think so.

You can start the job: cd /projects/NS9560K/cmor/noresm2cmor/namelists/CMIP6_NorESM2-MM/piControl ./cmor_tmp.sh -m=NorESM2-MM -e=piControl -v=v20191108 &>>logs/cmor.log.v20191108 &

YanchunHe commented 4 years ago

corrected the data path in sys*.nml should kill and resubmit if you have submited this job.

jgriesfeller commented 4 years ago

I had not started it so far. Doing that right now on login0

jgriesfeller commented 4 years ago

now I can see some noresm2cmor3_mp jobs running under my user name (jang).

YanchunHe commented 4 years ago

superb!

jgriesfeller commented 4 years ago

It's even still running :-) but likely only with the data at the original location at /projects/NS9560K/noresm/cases/N1850frc2_f09_tn14_20191012 But the first 50 years have to be cmorised as well anyway. I guess for the data at /tos-project1/NS9034K/noresm/cases/N1850frc2_f09_tn14_20191012 I have to start another job, right?

YanchunHe commented 4 years ago

no, in principle you don't need to start seperate jobs for different data locations. Please have a look at the cmor.sh (cmor_tmp.sh) and v20191108b/sys*.nml and workflow/cmorRun1memb.sh, you will easily find out how to set these. And also the workflow.md file.

jgriesfeller commented 4 years ago

(base) [jang@login-nird-3 historical]$ /projects/NS9560K/cmor/noresm2cmor/namelists/CMIP6_NorESM2-MM/piControl/checkcmorout.sh real: r1i1p1f1 Ofx, fx, etc 8 yyyy1 yyyy2 nf 1200 1209 517 1210 1219 517 1220 1229 517 1230 1239 517 1240 1249 517 1250 1259 517 1260 1269 517 1270 1279 517 1280 1289 517 1290 1299 517 1300 1309 581 1310 1319 581 1320 1329 581 1330 1339 178 1340 1349 139 1350 1359 163 1360 1369 130 1370 1379 193 1380 1389 581 1390 1399 581 1400 1409 581 1410 1419 544 1420 1429 95 1430 1439 113 1440 1449 115 1450 1450 581 Total: 10915

Obviously not all data has been chmorised properly. But I have no idea how to correct that.

jgriesfeller commented 4 years ago

After the experience with the other runs, I decided to run the years 1330 to 1370 again to see, if the output files are getting completed. For documentation:

(base) [jang@login-nird-0 piControl]$ pwd
/projects/NS9560K/cmor/noresm2cmor/namelists/CMIP6_NorESM2-MM/piControl
(base) [jang@login-nird-0 piControl]$ ./cmor_tmp_jang.sh -m=NorESM2-MM -e=piControl -v=v20191108 &>>logs/cmor_jang_1330_1370.log.v20191108 &
jgriesfeller commented 4 years ago

coming to the run in this actual thread again...

(base) [jang@login-nird-0 piControl]$ pwd
/projects/NS9560K/cmor/noresm2cmor/namelists/CMIP6_NorESM2-MM/piControl
(base) [jang@login-nird-0 piControl]$ ./checkcmorout.sh 
real:   r1i1p1f1
Ofx, fx, etc    8
yyyy1   yyyy2   nf
1200    1209    517
1210    1219    517
1220    1229    517
1230    1239    517
1240    1249    517
1250    1259    517
1260    1269    517
1270    1279    517
1280    1289    517
1290    1299    517
1300    1309    581
1310    1319    581
1320    1329    581
1330    1339    178
1340    1349    139
1350    1359    163
1360    1369    130
1370    1379    193
1380    1389    581
1390    1399    581
1400    1409    581
1410    1419    544
1420    1429    95
1430    1439    113
1440    1449    115
1450    1450    581
Total:      10915

There's quite some files missing. Looking at the log, e.g. from 1330-1339 one can find this:

 ----------------------------
 --- Process ocean output ---
 ----------------------------

 Read grid information from input files
 WARNING: no file found for case dir|tag|year1|month1|yearn|monthn: 
 /projects/NS9560K/noresm/cases/N1850frc2_f09_tn14_20191001|micom.hm|
        1330 |           1 |        1339 |          12
 WARNING: no file found for case dir|tag|year1|month1|yearn|monthn: 
 /projects/NS9560K/noresm/cases/N1850frc2_f09_tn14_20191001|micom.hm|
        1330 |           1 |        1339 |          12
 WARNING: no file found for case dir|tag|year1|month1|yearn|monthn: 
 /projects/NS9560K/noresm/cases/N1850frc2_f09_tn14_20191001|micom.hm|
        1330 |           1 |        1339 |          12
 WARNING: no file found for case dir|tag|year1|month1|yearn|monthn: 
 /projects/NS9560K/noresm/cases/N1850frc2_f09_tn14_20191001|micom.hm|
        1330 |           1 |        1339 |          12
 WARNING: no file found for case dir|tag|year1|month1|yearn|monthn: 
 /projects/NS9560K/noresm/cases/N1850frc2_f09_tn14_20191001|micom.hm|
        1330 |           1 |        1339 |          12

It seems I have to start transferring again Yanchun: Is there a way to start the cmorisation only for missing output files?

matsbn commented 4 years ago

It seems the script tries to read files for year 1330 of experiment N1850frc2_f09_tn14_20191001 but 1299 is the last year for that experiment. Year 1330 should be in N1850frc2_f09_tn14_20191012.

YanchunHe commented 4 years ago

Hi Jan, as Mats said, you create a new file cmor_tmp_jang.sh, and specify 1330-1370 to N1850frc2_f09_tn14_20191001. This is not right.

Please see the years spans for each CaseName, make sure the year periods fall in the specific CaseName, with cmor.sh as an template.

YanchunHe commented 4 years ago

I changed the cmor_tmp.sh, now it should work for 1330-1370. Note 1330-1340 are from NS9560K, while 1350-1370 are from NS9034K.

So, only revise the years1, years2 in place, but do not move them out from the CaseName and runcmor statements.

Please try to run: ./cmor_tmp.sh -m=NorESM2-MM -e=piControl -v=v20191108 &>>logs/cmor.log.v20191108 &

(I have very slow connection to NIRD and github at the hotel now).

matsbn commented 4 years ago

The last 250 years of piControl completed today so I took the liberty to edit the original issue to include that last case of piControl, namely:

N1850frc2_f09_tn14_20191113, 1450-1699

I also adjusted down the end year of the preceding case N1850frc2_f09_tn14_20191012 from 1450 to 1449. Both cases have 1450 included (with identical results), but with the 10 year time slices I guess it will be cleaner this way.

jgriesfeller commented 4 years ago

Yanchun, I have understood your system now, but according to the log there were some ocean files missing in the NS9560K area. I will check that again against the files at NIRD and if necessary start the missing periods using newly transferred data in the NS9034K area. Relax and enjoy your holiday, I should have everything under control now.

YanchunHe commented 4 years ago

Sounds great and thanks lot!

I will keep create new and update some namelists/scripts if there is in need.

Concerning the missing files, I would suggest before you transfer data, you get a total numbe of files on NID, and total files to be transferred from FRAM.

After transfer is finished, check the total file on nird again, and see if the increased number of files match that expected transferred file. If they match with each other, then start the cmorziation, othwerwise, it will have file missing and we should restart the cmor job again.

YanchunHe commented 4 years ago

The last 250 years of piControl completed today so I took the liberty to edit the original issue to include that last case of piControl, namely:

N1850frc2_f09_tn14_20191113, 1450-1699

I also adjusted down the end year of the preceding case N1850frc2_f09_tn14_20191012 from 1450 to 1449. Both cases have 1450 included (with identical results), but with the 10 year time slices I guess it will be cleaner this way.

I've updated the exp*.nml and cmor.sh for N1850frc2_f09_tn14_20191113, 1450-1699

year 1450 has been cmorized and stored in single years. We may remove those files before we publish.

jgriesfeller commented 4 years ago

I checked the number of files on NIRD. All periods have the right number. Started the cmorisation on nird2 for 1330 to 1370

jgriesfeller commented 4 years ago

status right now:

(base) [jang@login-nird-0 piControl]$ ./checkcmorout.sh 
real:   r1i1p1f1
Ofx, fx, etc    8
yyyy1   yyyy2   nf
1200    1209    517
1210    1219    517
1220    1229    517
1230    1239    517
1240    1249    517
1250    1259    517
1260    1269    517
1270    1279    517
1280    1289    517
1290    1299    517
1300    1309    581
1310    1319    581
1320    1329    581
1330    1339    581
1340    1349    581
1350    1359    163
1360    1369    130
1370    1379    193
1380    1389    581
1390    1399    581
1400    1409    581
1410    1419    544
1420    1429    95
1430    1439    113
1440    1449    115
1450    1450    581

started 1350 to 1379 on login0 again