Closed oloapinivad closed 5 years ago
With NCO I am getting plenty of error when running the validation. However, I cannot understand why! When I look into details of a single file, everything seems fine. I will ask Jon Seddon...
(validate) [pdavini0@r000u06l01 Year_1950_NCO]$ /marconi_work/IscrB_DIXIT/ecearth3/cmorization/primavera-val/bin/validate_data.py -s rldscs_Amon_EC-Earth3P_Primavera-atm_r1i1p1f1_gr_195001-195012.nc
WARNING: File failed validation:
The points in the time dimension in the file are not contiguous: rldscs_Amon_EC-Earth3P_Primavera-atm_r1i1p1f1_gr_195001-195012.nc
ERROR: 1 files failed validation
(validate) [pdavini0@r000u06l01 Year_1950_NCO]$ cdo info rldscs_Amon_EC-Earth3P_Primavera-atm_r1i1p1f1_gr_195001-195012.nc
-1 : Date Time Level Gridsize Miss : Minimum Mean Maximum : Parameter ID
1 : 1950-01-16 12:00:00 0 131072 0 : 32.168 260.07 452.62 : -1
2 : 1950-02-15 00:00:00 0 131072 0 : 57.797 259.70 451.82 : -1
3 : 1950-03-16 12:00:00 0 131072 0 : 56.102 259.96 435.76 : -1
4 : 1950-04-16 00:00:00 0 131072 0 : 31.353 263.37 443.76 : -1
5 : 1950-05-16 12:00:00 0 131072 0 : 30.210 270.96 446.55 : -1
6 : 1950-06-16 00:00:00 0 131072 0 : 40.631 278.94 454.33 : -1
7 : 1950-07-16 12:00:00 0 131072 0 : 30.421 281.32 498.45 : -1
8 : 1950-08-16 12:00:00 0 131072 0 : 30.593 281.16 486.36 : -1
9 : 1950-09-16 00:00:00 0 131072 0 : 23.926 272.30 462.63 : -1
10 : 1950-10-16 12:00:00 0 131072 0 : 45.330 265.33 440.96 : -1
11 : 1950-11-16 00:00:00 0 131072 0 : 9.7350 261.43 440.33 : -1
12 : 1950-12-16 12:00:00 0 131072 0 : 37.959 260.64 425.64 : -1
cdo info: Processed 1572864 values from 1 variable over 12 timesteps ( 0.13s )
@oloapinivad: sorry, our computer is down right now, cannot check the NCO version that I have used.
Missing compression is a big issue, a quick test shows that the size reduction is about 50% when compressing an atmosphere field with pressure levels. We definitley cannot live without compression! It could be possible to add another compression step after concatenation, if we cannot use cdo
we might be able to do something with ncks
. But I am wondering if we really want put a lot of effort into the concatenation of cmorized files at this stage, @goord is working on an update that would process one year of IFS output in one go and thus make the concatenation superfluos.
@oloapinivad: what kind of time axis are you using in the file test that triggers the compliance checker, absolute or relative time? `cdo sinfo`` will tell if there is a RefTime (should be there). And how about the time bounds, do they look OK?
@klauswyser: actually my compression discussion was unclear, i.e. CDO provides extra compression (I think because it optimize data exploiting of the longer timeseries on yearly files) while NCO simply concatenate the previously compressed data produced by ece2cmor3. Unfortunately I have the urgency to find a reasonable concatenation method since Marconi will be shut off for several weeks and I have to produce PRIMAVERA output quickly (and the archive architecture of Marconi has a cap on the number of files).
I am using Jon Seddon script for validation (https://github.com/PRIMAVERA-H2020/primavera-val), I need to see how it works. CDO-produced files all pass the validation, while 166/193 NCO-produced files fails (sigh!). All the metadata produced with NCO seems to be equal to the original one (see the example above), so I really cannot tell where the issue is coming.
CDO provides extra compression (I think because it optimize data exploiting of the longer timeseries on yearly files) while NCO simply concatenate the previously compressed data produced by ece2cmor3.
Are you sure about that? My experience is that concatenation with ncrcat
doesn't compress the resulting file, it uncompresses the files produced by ece2cmor3 and then concatenates the uncompressed data. cdo cat
will do the compression when used options -f nc4c -z zip
, otherwise it would save an uncompressed file. You can check with cdo sinfo
if your netcdf files are compressed or not.
Well, I double-checked and I can confirm that NCO maintain the original compression
[pdavini0@r000u06l01 cmorized]$ cdo sinfo Year_1950_CDO/tas_Amon_EC-Earth3P_Primavera-atm_r1i1p1f1_gr_195001-195012.nc
File format : NetCDF4 ZIP
[pdavini0@r000u06l01 cmorized]$ cdo sinfo Year_1950_NCO/tas_Amon_EC-Earth3P_Primavera-atm_r1i1p1f1_gr_195001-195012.nc
File format : NetCDF4 classic ZIP
[pdavini0@r000u06l01 cmorized]$ cdo sinfo Year_1950/CMIP6/PRIMAVERA/EC-Earth-Consortium/EC-Earth3P/Primavera-atm/r1i1p1f1/Amon/tas/gr/v20180918/tas_Amon_EC-Earth3P_Primavera-atm_r1i1p1f1_gr_195001-195001.nc
File format : NetCDF4 classic ZIP
The difference is that CDO moves to NetCDF4 instead of NetCDF4 classic. This provides an extra gain in terms of space that you can see here.
[pdavini0@r000u06l01 cmorized]$ du -sh Year_1950*
111G Year_1950
89G Year_1950_CDO
109G Year_1950_NCO
I will see how the compression in NetCDF4 classic works for CDO, perhaps it does not affect the metadata?
I am using nco 4.2.0 and with that version the result after ncrcat
is uncompressed.
That probably explains the difference.
I explored the cdo -f nc4c -z zip
, i.e. Netcdf4 Classic and I get the same compromised metadata together with the same 20GB gain in space occupied.
So I will probably go for CDO for the moment even if I am aware that the stream2 cluster1 control-1950 simulation should fulfill CMIP6 criteria, so this will mean that I will need to re-cmorize data later on.
Paolo, another option is to not concatenate at the moment for PRIMAVERA
You can concatenate on jasmin as well, or submit only monthly files to the primavera DMT and try to concatenate correctly when we submit to highresMIP
@goord you are right, but I need to find a way to store the data on the Marconi HPC and I need to reduce the number of files in order to do to so. The alternative is to tar.gz them now and concatenate later. However, as long as I understand your strategy, you are aiming at concatenating data within ece2cmor3 so a "posthoumous" concatenation will not be possible even in the future. Am I wrong?
Yes I am trying to do yearly post-processing + cmorization tasks, and I believe this feature will be used for CMIP6. But the result should be the same as an afterward concatenation. And the branch will only be merged in one or two weeks (I think rather 2) so if you're in a hurry go with the tar.gz option.
I will go with the tar.gz if you endorse it. However, I am investigating the issue that I have with the validation tool with Jon Seddon. https://github.com/PRIMAVERA-H2020/primavera-val/issues/16 I will let you know if I understand where the NCO-related issue come from.
Here is some input to the research about NCO:
ncrcat
just copies the data directly from the input files. It copies the relevant metadata (i.e., scale_factor and add_offset attributes) from the first file.
The interesting aspect here is the RefTime: in my case RefTime is taken from the first file but the time steps relative to RefTime are from the various .nc files. My first monthly mean file has RefTime Jan 1 and timestep 15.5 (days), the 2nd file Feb 1 has RefTime Feb 1 and timestep 14.0, the 3rd file has RefTime Mar 1 and tmestep 15.5, and so on. When concatenating, the resulting file gets RefTime Jan 1 (from teh 1st file), and time=15.5,14,15.5,15.0,15.5,... This is where the time axis gets garbled up! If we could set the RefTime in nc file this problem would be solved. Is there an option for that, e.g. in the metadata used by ece2cmor.py
?
Just noticed that ece2cmor.py
has an option to set the reference date:
--refd YYYY-mm-dd Reference date (for atmosphere data), by default the start date (default: None)
Actually in the more recent version of NCO I am using - if you see also the discussion I am having with Jon Seddon - what is happening is even different. The time var is ok now (indeed I cannot see the irregular time axis messed up as you have), but the time_bnds are built as a concatenation of the each file. This fails the validation - even if the files are apparently ok.
Regarding, the reference time, I guess this should be set!
I am looking at 1951 data and I have for January:
time:units = "days since 1951-01-01 00:00:00" ;
and for February
time:units = "days since 1951-02-01 00:00:00" ;
Perhaps is one of the issue!
I just got a NCO 4.6.3 installed on our new computer, and my concatenation problem with time stamps relative to the 1st of each month has gone. But the problem with time_bnds remains:
time_bnds =
0, 31,
0, 28,
0, 31 ;
This is for a file with Jan, Feb and Mar, so the time bounds should be something like this
time_bnds =
0, 31,
31, 58,
58, 89 ;
I ma testing right now if this gets better if I set
--refd $y-01-01
when processing each month of a year.
@klauswyser I have just tested the same, but setting --refd $firstyear-01-01
(In this way we can think about further concatenation in the future).
time_bnds are now correct:
data:
time_bnds =
0, 31,
31, 59,
59, 90,
90, 120,
120, 151,
151, 181,
181, 212,
212, 243,
243, 273,
273, 304,
304, 334,
334, 365 ;
}
Overall, the concatenation with NCO is now smooth and the validation by Jon Seddon seems to run without any error (it is long, still running).
The only issue I see is that dimension bnds has disappeared in my new NCO-made file:
[pdavini0@r000u06l01 cmorized]$ diff jan.txt nco.txt
1c1
< netcdf zg_6hrPlevPt_EC-Earth3P_Primavera-atm_r1i1p1f1_gr_195001010000-195001311800 {
---
> netcdf zg_6hrPlevPt_EC-Earth3P_Primavera-atm_r1i1p1f1_gr_195001010000-195012311800 {
3c3
< time = UNLIMITED ; // (124 currently)
---
> time = UNLIMITED ; // (1460 currently)
7d6
< bnds = 2 ;
Do we care about this?
Validation of yearly files successful
NCO (ncrcat) 4.6.4 seems a working solution with --refd $firstyear-01-01
DEBUG: 192 files found.
DEBUG: All files successfully validated.
Actually we may even put 1850-01-01 as the reference date (as we had for all CMIP5 simulations). By doing so it becomes pretty simple to concatenate historical and scenario runs. For Primavera experiments the RefDate could also be 1950-01-01 of course.
BTW: with NCO 4.6.3 the bnds dimension doesn't disappear when concatenating, but the order of dimensions has changed when printing out the header with ncump.
BTW: I just noticed that the NEMO files produced by ece2cmor.py
have
time:units = "days since 1900-01-01 00:00:00" ;
Should we take the same RefDate for the atmosphere?
I agree with @klauswyser , we should use the same reference date.
@goord Would it be possible to extend the --refd YYYY-mm-dd
flag also to oceanic data?
Sure
Hi @klauswyser I just merged the ifs yearly processing branch and added the feature you requested, i.e. the reference date is also used for the nemo cmorization.
As discussed with @klauswyser and @goord we are testing concatenation of monthly IFS files into yearly files.
Long story short, CDO (both cat or mergetime command) cannot be used since it modifies the metadata of the files. NCO (ncrcat command) showed some issue on @klauswyser data, however seems to be working fine now that the most of the issues with missing record is solved (see #245)
Here an example:
Putting ncdump outputs in text files
CDO vs. Original data
NCO vs. original data
It seems evident that at least for this case, using NCO 4.6.4, ncrcat is preserving the metadata. NCO is also several times faster than CDO, however it does not apply the zip compression to the whole file but only operates concatenation, meaning that we are losing a potential improvement in file size. Indeed, one year of T255L91 data IFS-only is 110 GB with NCO and 90GB with CDO @klauswyser , which version of NCO were you using?