Weiming-Hu / AnalogsEnsemble

The C++ and R packages for parallel ensemble forecasts using Analog Ensemble
https://weiming-hu.github.io/AnalogsEnsemble/
MIT License
18 stars 5 forks source link

AnEn + MPI + TAU program aborted due to error #28

Closed Weiming-Hu closed 5 years ago

Weiming-Hu commented 5 years ago

I compiled the code with the following command.

CC=tau_cc.sh CXX=tau_cxx.sh cmake -DENABLE_MPI=ON -DCMAKE_PREFIX_PATH=/home/graduate/wuh20/packages/release/ -DBOOST_TYPE=SYSTEM -DCMAKE_BUILD_TYPE=Debug ..
make -j 16

I encountered the following error.

OMP_NUM_THREADS=3 mpirun -np 1 /home/graduate/wuh20/github/AnalogsEnsemble/output/bin/standardDeviationCalculator -v 6 -i /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201712.nc /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201801.nc -o ~/exfat-hu/Data/2019_Hu_AnEn-bias-correction/sds/sds-0001.nc --start 0 0 0 0 0 0 0 0 --count 17 100 31 53 17 100 31 53

Parallel Ensemble Forecasts --- Standard Deviation Calculator v 3.2.1
Copyright (c) 2018 Weiming Hu @ GEOlab
Input parameters:
in_files: /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201712.nc,/home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201801.nc,
out_file: /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/sds/sds-0001.nc
verbose: 6
config_file: 
start: 0,0,0,0,0,0,0,0,
count: 17,100,31,53,17,100,31,53,
Checking mode ...
Checking file (/home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/sds/sds-0001.nc) ...
Combining forecasts along the time dimension...
Checking mode ...
Checking file (/home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201712.nc) ...
Checking file type (Forecasts) ...
Checking dimension (num_parameters) ...
Checking dimension (num_stations) ...
Checking dimension (num_times) ...
Checking dimension (num_flts) ...
Checking dimension (num_chars) ...
Checking variable (Data) ...
Checking variable (FLTs) ...
Checking variable (Times) ...
Checking variable (ParameterNames) ...
Checking variable (Xs) ...
Checking variable (Ys) ...
Processing partial meta information ...
Reading Parameters from file (/home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201712.nc) ...
Reading dimension (num_parameters) length ...
Checking variable (ParameterCirculars) ...
Checking variable (ParameterWeights) ...
Reading Stations from file (/home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201712.nc) ...
Reading dimension (num_stations) length ...
Spawning 3 processes to read StationNames ...
Broadcasting variables ...
Child rank #0 received from the parent's broadcast ...
Child rank #1 received from the parent's broadcast ...
Child rank #2 received from the parent's broadcast ...
Child rank #0 reading StationNames with start/count ( 0,33 0,50 ) ...
Child rank #2 reading StationNames with start/count ( 66,34 0,50 ) ...
Child rank #1 reading StationNames with start/count ( 33,33 0,50 ) ...
Parent waiting to gather data from processes ...
Rank #0 sending data (1650) back to the parent ...
Rank #2 sending data (1700) back to the parent ...
Rank #1 sending data (1650) back to the parent ...
[sapphire:02637] *** Process received signal ***
[sapphire:02637] Signal: Segmentation fault (11)
[sapphire:02637] Signal code: Address not mapped (1)
[sapphire:02637] Failing at address: (nil)
[sapphire:02637] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x1288f)[0x7f93c43cf88f]
[sapphire:02637] [ 1] mpiAnEnIO(MPI_Gatherv+0x120)[0x56347a6ffc00]
[sapphire:02637] [ 2] mpiAnEnIO(main+0x10ef)[0x56347a63ab1d]
[sapphire:02637] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe6)[0x7f93c3fedb96]
[sapphire:02637] [ 4] mpiAnEnIO(_start+0x29)[0x56347a638f09]
[sapphire:02637] *** End of error message ***
Reading Times from file (/home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201712.nc)   ...
Reading dimension (num_times) length ...
Reading FLTs from file (/home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201712.nc) ...
Reading dimension (num_flts) length ...
Combining times ...
...
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node sapphire exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Weiming-Hu commented 5 years ago

Same problem exists when I run

OMP_NUM_THREADS=3 mpirun -np 1 /home/graduate/wuh20/github/AnalogsEnsemble/output/bin/analogGenerator --test-forecast-nc /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201802.nc --search-forecast-nc /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201710.nc --observation-nc /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/observations/201710.nc --members 5 --analog-nc analogs.nc -v 6
Weiming-Hu commented 5 years ago

This post might shed some insight. Problem might be the intra and inter communicator.

Weiming-Hu commented 5 years ago

This issue has been resolved in commit 10744ae.