Grib filtering really slow

jpkeskinen commented 4 years ago

Hi,

I'm trying to cmorise the AerChemMIP historical run but the processing of the ifs output takes a lot of time. The grib filtering part seems to be the main problems with the last task taking about 41 hours. The whole ece2mor for ifs takes about 13 days. Also, the time required by the grib filtering appears to increase during the process.

Our computing centre is guessing that the problem could be due to the parallelisation as currently ece2cmor runs only on one core in our machine. Which parts of ece2cmor are supposed to work in parallel?

I am currently using version 1.5. I used version 1.4 earlier and it worked much faster with (almost) the same data.

Any ideas on how to get ece2cmor work a bit faster?

jpkeskinen commented 4 years ago

Here's a log of a ece2cmor run. HIS2-ifs-001-20200612023731.log

treerink commented 4 years ago

Hi Jukka-Pekka,

First some questions:

So how long does the cmorisation of 1 year of IFS data take?

Are you cmorising model level data? I guess no? If no, did you set --skip_alevel_vars?

The current last release of ece2cmor3 is v1.4 only the latest master is further. The next version v1.5 has not been released yet.

jpkeskinen commented 4 years ago

One year of IFS data took me 13 days 5 hours 47 minutes and 1 second to process. I have been able to cmorise only one year so far but there are currently four more jobs running.
I don't know if I am cmorising model level data. I've just tried to follow whatever instructions I've found. In any case, it's the AerChemMIP historical experiment and I'm not attempting anything extra. I did not set --skip_alevel_vars but this was because Philippe writes in the EC-Earth portal that "when running ece2cmor, do NOT use the --skip_alevel_vars option."
Sorry, I mixed up the version numbers. I'm using now 1.4 and used 1.3 earlier.

goord commented 4 years ago

Hi @jpkeskinen what kind of storage is you raw output on? It's not some network mounted disc right?

jpkeskinen commented 4 years ago

Hi Gijs,

The raw EC-Earth output is on our scratch. I couldn't find any further specifics other than it is a "Lustre parallel storage system".

goord commented 4 years ago

Can you do a speed test and copy a year of data from your model output location to your temporary data folder (the one you specify in ece2cmor3 invocation)?

goord commented 4 years ago

hmm strange, lustre should work fine. The filtering is the slow part of the processing, but it shouldn't take more than 15 min. for standard resolution IFS on lustre... but do the copy of the grib files just to be sure it's not an issue with your storage please

jpkeskinen commented 4 years ago

I've tried two different locations for the temporary data: the scratch and the local scratch. The latter is supposed to be faster and I have currently ece2cmor running (almost 4 days gone and still filtering the grib files) using that.

Copying a year of raw ifs data takes 17 min 43.200 s to the regular scratch and 8 min 28.244 s to the local scratch.

goord commented 4 years ago

From your log I see the post-processing and cmorization also takes abnormally long time. Which machine are you running on?

jpkeskinen commented 4 years ago

The computer is CSC's Puhti. It's an Atos cluster with Intel Xeon processors (Cascade Lake) at 2,1 GHz.

goord commented 4 years ago

Hi @jpkeskinen maybe it is something with the conda installation? Can you use the conda-installed cdo to do a simple task (select some grib code and do e.g. a spectral transform) with this? So in your submit script, first activate the ece2cmor conda environment and then cdo -f nc -monmean -sp2gpl -selcode,130 <output-file> ta.nc.

It may also be you are using too many threads, but that shouldn't affect the filtering too much, which uses only 2 threads

jpkeskinen commented 4 years ago

The cdo command gives me an error:

Warning (cgribexScanTimestep2): Record 104 (id=133.128 lev1=100000 lev2=0) timestep 2: Parameter not defined at timestep 1!
cdo(3) selcode: Open failed on >ICMGGHIS2+185007<
Unsupported file structure

jpkeskinen commented 4 years ago

I managed to work around the problem with grib_filter (thanks to Tommi). I now have two grib files, one presumably for model levels and the other for pressure levels. The cdo command takes about 1 second for the pressure level file and about 2 minutes for the model level file.

etiennesky commented 4 years ago

in the end what was the solution? It might be helpful for others in the same situation

jpkeskinen commented 4 years ago

in the end what was the solution? It might be helpful for others in the same situation

The original problem is still there. I was just referring to the cdo test cdo -f nc -monmean -sp2gpl -selcode,130 <output-file> ta.nc requested by Gijs.

Declan-FMI commented 4 years ago

I have the same problem as Jukka-Pekka, both on puhti and on cca at ECMWF. On cca, even when modifying only the paths in ece2cmor/scripts/submit_script_examples/submit-at-cca-ece2cmor-leg-job.sh, which Thomas says works for him, and specifying --skip-alevel-vars (for test purposes), I still get the same exponentially growing execution times in grib_filter.

treerink commented 4 years ago

@plesager you are cmorising similar ECE-AerChem results on cca with reasonable speed, right?

goord commented 4 years ago

Ok it seems I broke the performance of the grib filter during previous release... I will have a look at it tomorrow

plesager commented 4 years ago

@plesager you are cmorising similar ECE-AerChem results on cca with reasonable speed, right?

Cmorizing one year of IFS (including model levels) with 1.4, I have:

on cca (ECMWF): 2.5 to 2.8 hours (18 threads, max in nf queue). One experiment, variability due to system load.
on rhino (KNMI Atos//bull): 3.45 to 4.25 hours (28 threads). Very steady, the difference stems from experiment data requests.

Because I cmorize leg N-2 when running leg N, I've not paid too much attention to the performance (as long as cmorization is faster than EC-Earth I'm happy) and I do not have the numbers from the previous set of runs I did with 1.3 (I vaguely remember to be closer to 3 hours for one leg of IFS).

goord commented 4 years ago

Perhaps @Declan-FMI and @jpkeskinen you are using too few or too many threads? Can you post the submission script?

jpkeskinen commented 4 years ago

The 13 day run was done with a single process. This was because I was unable to get any clear advantage on using more. For this reason I suspect I am doing something wrong regarding the parallelisation of ece2cmor.

I am not that familiar with threads but I am trying different things. The submission script I used for the currently running jobs (5 and ½ days gone and still filtering the grib files) is here: ece2cmor-log-array-job-historical-ifs.sh.txt

goord commented 4 years ago

Hi @jpkeskinen what happens if you omit the '#SBATCH -n 1" line? This may confuse the workload manager, it can deduce from your -c option how many resources to allocate. You should give ece2cmor3 at least 2 processes because that allows the grib filter to process the gridpoint file and spectral file in parallel.

It the problem persists you will need to profile the application on your system, it will be an interesting exercise!

Declan-FMI commented 4 years ago

Here are the scripts I used on cca / puhti. I have tried cutting the number of cores, increasing the memory per core, using eccodes instead of grib-api, reverting to ece2cmor1.3.0 (yes I know it is not compatible with ec-earth 3.3.2.1), none of those helped. Module evnironment on puhti:Currently Loaded Modules: 1) intel/18.0.5 4) cdo/1.9.7.1 7) hdf5/1.10.4-mpi 10) grib-api/1.24.0 2) intel-mkl/2018.0.5 5) nco/4.9.3 8) netcdf/4.7.0 11) hpcx-mpi/2.4.0 3) hdf/4.2.13 6) ncview/2.1.7 9) netcdf-fortran/4.4.4 and on cca: Currently Loaded Modulefiles: 1) modules/3.2.10.5(default) 15) dmapp/7.0.1-1.0502.11080.8.76.ari 2) eswrap/1.3.3-1.020200.1280.0 16) rca/1.0.0-2.0502.60530.1.62.ari 3) switch/1.0-1.0502.60522.1.61.ari 17) atp/2.1.0(default) 4) intel/17.0.3.053(default) 18) PrgEnv-intel/5.2.82 5) craype-broadwell 19) jasper/1.900.1(default) 6) craype-network-aries 20) grib_api/1.24.0(nov17) 7) craype/2.5.9(default) 21) nco/4.9.2(jun20:new:default) 8) pbs/13.0.403.161593 22) subversion/1.13.0(new:default) 9) cray-mpich/7.5.3(default) 23) python/2.7.15-01(jun20:new:jun19:default) 10) cdt/17.03 24) git/2.26.0(jun20:new:default) 11) cray-libsci/16.11.1(default) 25) cray-netcdf-hdf5parallel/4.4.1.1(default) 12) udreg/2.3.2-1.0502.10518.2.17.ari 26) cray-hdf5-parallel/1.10.0.1(default) 13) ugni/6.0-1.0502.10863.8.29.ari 27) cdo/1.9.8(jun20:new:default) 14) pmi/5.0.11(default)

cca-ece2cmor.sh.txt cmorize_hist_ens3.sh.txt

jpkeskinen commented 4 years ago

Hi @jpkeskinen what happens if you omit the '#SBATCH -n 1" line? This may confuse the workload manager, it can deduce from your -c option how many resources to allocate. You should give ece2cmor3 at least 2 processes because that allows the grib filter to process the gridpoint file and spectral file in parallel.

I can try this next and I'll report back what happens. However, I'm starting my holidays today and will be back in August. Thank you for the help so far!

goord commented 4 years ago

@plesager can you spot the difference with your setup? Are you sure you are actually running the 1.4 version?

@jpkeskinen I am too leaving for holidays (now actually). When I get back the 10th I will have a look at the issue at ECMWF. @Declan-FMI I propose you make your output files (or at least one leg) at ECMWF readable to all so I can run the cmorization myself and profile the application. You can also try to use the parallel queue and see if it is faster. I can start on this when I get back i.e. Monday 13th.

Declan-FMI commented 4 years ago

Output files under /scratch/ms/fi/fi2k/aerchemmip/HPI2/output should be readable (in fact my whole $SCRATCH)

plesager commented 4 years ago

I confirm that I'm using 1.4 on cca. From the log:

2020-07-03 09:15:13 INFO:ece2cmor3.ece2cmorlib: Metadata attribute history: processed by ece2cmor v1.4.0, git rev. local unknown branch

plesager commented 4 years ago

That cca-ece2cmor.sh.txt is not easy to compare with my script. Could you point me to a script it created (one of its ${job_name})? I spotted one difference though. I have this line: export HDF5_USE_FILE_LOCKING=FALSE in my script. Not sure if that makes any difference.

Declan-FMI commented 4 years ago

I have overwritten the generated script a few times today and it is currently running in an effort to profile the grib_filter part of ece2cmor3. So the script below differs a little from what was generated by the script I posted earlier (just added some stuff to set up extrae, module load papi extrae and the 6 lines below it). cmor_HPI2-ifs-001.sh.txt

plesager commented 4 years ago

Ok, so another difference, but again I doubt it chances anything. I do not have:

export OMP_NUM_THREADS=18

Declan-FMI commented 4 years ago

No, that last one I have tried with/without, makes no difference whatsoever. Unsurprisingly, since OpenMP does not enter into the equation at any point.

Declan-FMI commented 4 years ago

Still no progress. On cca, Philippe has kindly shared his post-processing script, but despite using this with minimal adaptation, and installing the same conda version as Philippe is using (4.7.12), I still get these huge increases in grib_filter times. At last attempt, the second ICMSH file took 1 hour.

goord commented 4 years ago

OMP_NUM_THREADS should not make any difference, ece2cmor3 uses subprocesses, not openmp threads. I will have a look this afternoon Declan

goord commented 4 years ago

I see the same bottleneck Declan. I believe it is the model levels in the raw output, Philippe do you have any of those? I am working on a solution to 'fast-forward' through model level blocks if they aren't requested.

plesager commented 4 years ago

Yes, you can check the output in: /scratch/ms/nl/nm6/ECEARTH-RUNS/onep/output/ifs/NNN with NNN > 113. The corresponding control output dir is: /scratch/ms/nl/nm6/3.3.2.1-aerchemmip-RT/classic/ctrl/cmip6-output-control-files-pextra/CMIP/EC-EARTH-AerChem/cmip6-experiment-CMIP-1pctCO2

goord commented 4 years ago

The main issue here seemed to be memory cleanup bug in the calls to grib api. I have also included a 'fast-forward' feature to skip model level blocks in the input grib files if not requested. I am doing the final testing now in the filter-fast-forward branch.

goord commented 4 years ago

My tests seem to be ok, @jpkeskinen and @Declan-FMI can you test the filter-fast-forward branch?

Declan-FMI commented 4 years ago

Now running on cca and puhti (Finnish supercomputer centre Bull machine). Let's see...if I understand rightly, this should speed up cmorization where model level data is not requested, but if it is requested we should not expect to see any difference. Or did I get something wrong ?

goord commented 4 years ago

The memory issue I solved should make everything faster. The fast-forward feature should help when model-level data is not requested (--skip-alevel-vars) but is present in your raw grib files. You can monitor the filtering a bit by just listing all file sizes in the temp directory btw.

Declan-FMI commented 4 years ago

It is clear even from the tail of the log files that this is running much faster now. Both on puhti and cca.

Declan-FMI commented 4 years ago

cmorization appears to have successfully completed on both puhti and cca within very acceptable time. I haven't made any detailed check of the results (which is why I wrote 'appears to'), but at least the present issue looks to be solved.

treerink commented 4 years ago

I got identical results with the filter-fast-forward branch with my test-all case, see #542. I was about to merge this branch into the master, but saw the latest Travis test of this branch failed which hold me of merging.

goord commented 4 years ago

Tests have been fixed, branch has been merged (squash)

EC-Earth / ece2cmor3

Grib filtering really slow #640