E3SM-Project / scorpio

A high-level Parallel I/O Library for structured grid applications
19 stars 16 forks source link

use scorpio in mitgcm #504

Closed xiaohumengdie closed 1 year ago

xiaohumengdie commented 1 year ago

Hi! I used scorpio in mitgcm model and output with adios. By using The following Settings, each rank wrote a data.rank file in bp directory(5126 rank, 5126 data files). The conversion tool reported the following error. So, How can I use the adios2 format? pio_numiotasks is 213, bp files should be 213?

Many thanks in advance!

###scorpio setting pio_netcdf_format = "64bit_offset" pio_numiotasks = 213 pio_rearranger = 1 pio_root = 1 pio_stride = 24 pio_typename = "adios" pio_rearr_comm_fcd = "2denable" pio_rearr_comm_type = "p2p" pio_rearr_comm_enable_hs_io2comp = .false. pio_rearr_comm_max_pend_req_io2comp = 65 pio_rearr_comm_enable_isend_io2comp = .true. pio_rearr_comm_enable_hs_comp2io = .true. pio_rearr_comm_max_pend_req_comp2io = 64 pio_rearr_comm_enable_isend_comp2io = .false.

mpirun -np 4 adios2pio-nm.exe --bp-file=/home/polar/adios_test/ocean.0000263280.nc.bp --rearr=box --pio-format=pnetcdf --verbose ADIOS ERROR: [Fri Jan 13 13:23:39 2023] [ADIOS2 EXCEPTION] : couldn't open file /home/polar/adios_test/ocean.0000263280.nc.bp/data.987, in call to POSIX open: errno = 24: Too many open files : iostream error ADIOS ERROR: [Fri Jan 13 13:23:39 2023] [ADIOS2 EXCEPTION] : couldn't open file /home/polar/adios_test/ocean.0000263280.nc.bp/data.985, in call to POSIX open: errno = 24: Too many open files : iostream error ADIOS ERROR: [Fri Jan 13 13:23:39 2023] [ADIOS2 EXCEPTION] : couldn't open file /home/polar/adios_test/ocean.0000263280.nc.bp/data.989, in call to POSIX open: errno = 24: Too many open files : iostream error ADIOS ERROR: [Fri Jan 13 13:23:39 2023] [ADIOS2 EXCEPTION] : couldn't open file /home/polar/adios_test/ocean.0000263280.nc.bp/data.989, in call to POSIX open: errno = 24: Too many open files : iostream error exception: ConvertBPFile error. exception: ConvertBPFile error. exception: ConvertBPFile error. exception: ConvertBPFile error.

jayeshkrishna commented 1 year ago

At the first glance it looks like you are hitting the limit on the number of files that can be opened by a process/user. You could try running the conversion tool with a larger number of processes (mpirun -np 32 adios2pio-nm.exe ... or 64 processes) or increase your ulimit ("ulimit -n xxxx", I would recommend contacting your sysadmin to set the ulimit on your system)

dqwu commented 1 year ago

Which scorpio release do you use? Is it latest 1.4.1 or an older release?

xiaohumengdie commented 1 year ago

Thank you very much for your fast reply ! I try running the conversion tool with 32 processes, the ulimit -n is 1000, The total number of data files is 5126, so there are fewer than 1000 files per process. the same error will occur.

Which scorpio release do you use? Is it latest 1.4.1 or an older release? the software stacks is hdf5-1_13_3,ADIOS2-2.8.3,netcdf-c-4.9.1-rc1,netcdf-fortran-4.6.0 ,scorpio-v1.4.1

dqwu commented 1 year ago

Could you please verify num_iotasks and stride passed to PIOc_Init_Intracomm (src/clib/pioc.c, line 1180)?

int PIOc_Init_Intracomm(MPI_Comm comp_comm, int num_iotasks, int stride, int base,
                        int rearr, int *iosysidp)

For your settings (pio_numiotasks = 213, pio_stride = 24) I would expect 213 bp data files but you actually get 5126 files.

jayeshkrishna commented 1 year ago

Also use "ulimit -a" and check out the per-user limits on the number of processes (Try setting the process limits to unlimited, if possible, or a large value)

dqwu commented 1 year ago

We will create a temp scorpio feature branch with some debug prints for you to test later. It will help us find out why there are 5126 BP data files generated (213 files expected).

xiaohumengdie commented 1 year ago

We will create a temp scorpio feature branch with some debug prints for you to test later. It will help us find out why there are 5126 BP data files generated (213 files expected).

Thank you very much ! I also try a 20x16 case (only 4 rank ),the settings (pio_numiotasks = 2, pio_stride = 2). The bp directory only have 1 data.0 file(data.0 md.0 md.idx mmd.0 profiling.json), the conversion tool can work. this case should have 2 bp data files? If i use the pnetcdf setting, all cases can work well. for a case(11520x14580), It takes 10 minutes to write a 680GB file.

jayeshkrishna commented 1 year ago

Please check your per-user limits first, if possible like I recommended above set it to unlimited and see if the tool runs successfully

dqwu commented 1 year ago

In summary there are two issues. 1) POSIX open: errno = 24: Too many open files For this issue please check your per-user limits on the number of processes. 2) There are 5126 BP data files (expected: 213 or less) For this issue we will provide a feature branch with debug prints for testing (should be available in a few days).

Please check your per-user limits first, if possible like I recommended above set it to unlimited and see if the tool runs successfully

Here is a related article: https://woshub.com/too-many-open-files-error-linux

xiaohumengdie commented 1 year ago

this is a good news, set * hard nofile 97816 * soft nofile 97816,the the tool runs successfully !

I ran the same case on different computer systems, using 1114 ranks and settingpio_numiotasks = 278 pio_stride =4. One computer system had 47 BP files and the other had 1114 BP files. After some output digging, we realized that things may go wrong with the machine, which I think may take me more debugs to figure it out.

Thanks for the quick reply!

dqwu commented 1 year ago

@xiaohumengdie A temp scorpio feature branch (provided by @tkurc) is now available for testing:

cd /path/to/scorpio
git fetch origin
git checkout tkurc/adios2_test

In addition to some debug prints, it also tries to reduce the number of BP data files if the ratio of num_comptasks and io_group_size is greater than 512.

Please use this feature branch to rebuild and rerun your cases on different computer systems you have access to.

For your original case run with 5126 BP data files generated (and/or another case run with 1114 BP data files generated), see if the number of BP data files will be reduced.

For each case run, please also paste the debug prints lines (starting with ADIOS2_INFO) for us to take a look. Example debug prints on Summit of ORNL (1344 MPI tasks, 16 compute nodes, 84 tasks per node, pio_root = 0, pio_stride = 21, pio_numiotasks = 64):

ADIOS2_INFO: INIT_ADIOS_1 mpi_rank: 0 num_iotasks: 64 num_comptasks: 1344 io_group_size: 21 nodeNProc: 84 nodeRank: 0
ADIOS2_INFO: INIT_ADIOS_1 mpi_rank: 1343 num_iotasks: 64 num_comptasks: 1344 io_group_size: 21 nodeNProc: 84 nodeRank: 83
ADIOS2_INFO: INIT_ADIOS_2 mpi_rank: 0 num_iotasks: 64 num_comptasks: 1344 io_group_size: 21 adios_rank: 0 num_adiostasks: 64
ADIOS2_INFO: INIT_ADIOS_2 mpi_rank: 1323 num_iotasks: 64 num_comptasks: 1344 io_group_size: 21 adios_rank: 63 num_adiostasks: 64
...
ADIOS2_INFO: CREATE_FILE mpi_rank: 0 num_iotasks: 64 num_comptasks: 1344 adios_rank: 0 num_adiostasks: 64
ADIOS2_INFO: CREATE_FILE mpi_rank: 1323 num_iotasks: 64 num_comptasks: 1344 adios_rank: 63 num_adiostasks: 64
...
xiaohumengdie commented 1 year ago

@dqwu Thank you so much for this very helpful instruction!

debug print as follows:

one computer system

ADIOS2_INFO: INIT_ADIOS_1 mpi_rank: 1113 num_iotasks: 278 num_comptasks: 1114 io_group_size: 5 nodeNProc: 10 nodeRank: 9
ADIOS2_INFO: INIT_ADIOS_1 mpi_rank: 0 num_iotasks: 278 num_comptasks: 1114 io_group_size: 5 nodeNProc: 24 nodeRank: 0
ADIOS2_INFO: INIT_ADIOS_2 mpi_rank: 0 num_iotasks: 278 num_comptasks: 1114 io_group_size: 5 adios_rank: 0 num_adiostasks: 232
ADIOS2_INFO: INIT_ADIOS_2 mpi_rank: 1109 num_iotasks: 278 num_comptasks: 1114 io_group_size: 5 adios_rank: 231 num_adiostasks: 232

ADIOS2_INFO: CREATE_FILE mpi_rank: 1109 num_iotasks: 278 num_comptasks: 1114 adios_rank: 231 num_adiostasks: 232
ADIOS2_INFO: CREATE_FILE mpi_rank: 0 num_iotasks: 278 num_comptasks: 1114 adios_rank: 0 num_adiostasks: 232
ADIOS2_INFO: CLOSE_FILE mpi_rank: 0 num_iotasks: 278 num_comptasks: 1114 adios_rank: 0 num_adiostasks: 232
ADIOS2_INFO: CLOSE_FILE mpi_rank: 1109 num_iotasks: 278 num_comptasks: 1114 adios_rank: 231 num_adiostasks: 232
the other computer system

ADIOS2_INFO: INIT_ADIOS_1 mpi_rank: 0 num_iotasks: 278 num_comptasks: 1114 io_group_size: 1 nodeNProc: 1 nodeRank: 0
ADIOS2_INFO: INIT_ADIOS_1 mpi_rank: 1113 num_iotasks: 278 num_comptasks: 1114 io_group_size: 1 nodeNProc: 1 nodeRank: 0
ADIOS2_INFO: INIT_ADIOS_2 mpi_rank: 0 num_iotasks: 278 num_comptasks: 1114 io_group_size: 1 adios_rank: 0 num_adiostasks: 1114
ADIOS2_INFO: INIT_ADIOS_2 mpi_rank: 1113 num_iotasks: 278 num_comptasks: 1114 io_group_size: 1 adios_rank: 1113 num_adiostasks: 1114

ADIOS2_INFO: CREATE_FILE mpi_rank: 0 num_iotasks: 278 num_comptasks: 1114 adios_rank: 0 num_adiostasks: 1114
ADIOS2_INFO: CREATE_FILE mpi_rank: 1113 num_iotasks: 278 num_comptasks: 1114 adios_rank: 1113 num_adiostasks: 1114
ADIOS2_INFO: CLOSE_FILE mpi_rank: 1113 num_iotasks: 278 num_comptasks: 1114 adios_rank: 1113 num_adiostasks: 1114
ADIOS2_INFO: CLOSE_FILE mpi_rank: 0 num_iotasks: 278 num_comptasks: 1114 adios_rank: 0 num_adiostasks: 1114

the temp scorpio feature branch is very useful, The problem occurs in the following code.

mpierr = MPI_Comm_split_type(ios->union_comm, MPI_COMM_TYPE_SHARED, 0, info, &nodeComm);
if (mpierr != MPI_SUCCESS)
 {
        return check_mpi(ios, NULL, mpierr, __FILE__, __LINE__);
 } 

 MPI_Comm_rank(nodeComm, &nodeRank);
 MPI_Comm_size(nodeComm, &nodeNProc);

In the other HPC system, the NodeNproc will print the value 1 but not 4, although each node has 4 processor and it still output 1114BP files .

In addition to some debug prints, it also tries to reduce the number of BP data files if the ratio of num_comptasks and io_group_size is greater than 512.

I think it's the problem of MPI in our system. Thank you again.

xiaohumengdie commented 1 year ago

@dqwu

Due to the problem of MPI in our system, I changed the code(src/clib/pioc.c, line 1056) and then fixed the problem.

Original code

MPI_Comm nodeComm = MPI_COMM_NULL;
int nodeNProc, nodeRank;
mpierr = MPI_Comm_split_type(ios->union_comm, MPI_COMM_TYPE_SHARED, 0, info, &nodeComm);
The modified code

int world_rank, world_size;
MPI_Comm_rank(ios->union_comm, &world_rank);
MPI_Comm_size(ios->union_comm, &world_size);

int sw_color = world_rank / ios->num_iotasks; 

MPI_Comm nodeComm = MPI_COMM_NULL;
int nodeNProc, nodeRank;
mpierr=MPI_Comm_split(ios->union_comm, sw_color, world_rank, &nodeComm);
dqwu commented 1 year ago

@xiaohumengdie Thanks for your update.

In your modified code, to simulate MPI_Comm_split_type (does not work as expected due to a problem of MPI in your system) with MPI_Comm_split you used the following formula to calculate color: int sw_color = world_rank / ios->num_iotasks;

Should ios->num_iotasks (a non-fixed value depending on pio_stride) be the number of processes per compute node (a constant in a specific system) instead? int sw_color = world_rank / number_of_processes_per_compute_node; // On Summit of ORNL, this constant is 84

If MPI in a system has no issues, MPI_Comm_split_type call with MPI_COMM_TYPE_SHARED argument can figure out the number of processes per compute node (e.g. 84 on a single Summit compute node). It is supposed to create sub-communicators such that each sub-communicator consists of processes running on the same node (processes that can form a shared-memory group).

xiaohumengdie commented 1 year ago

@dqwu Thanks for the quick reply!

In extreme cases(very few processors per node), number_of_processes_per_compute_node may not work very well, This leads to a lot of bp files. By using ios->num_iotasks, I can flexibly control the number of bp files I want.

Should ios->num_iotasks (a non-fixed value depending on pio_stride) be the number of processes per compute node (a constant in a specific system) instead? int sw_color = world_rank / number_of_processes_per_compute_node; // On Summit of ORNL, this constant is 84

dqwu commented 1 year ago

@dqwu Thanks for the quick reply!

In extreme cases(very few processors per node), number_of_processes_per_compute_node may not work very well, This leads to a lot of bp files. By using ios->num_iotasks, I can flexibly control the number of bp files I want.

Should ios->num_iotasks (a non-fixed value depending on pio_stride) be the number of processes per compute node (a constant in a specific system) instead? int sw_color = world_rank / number_of_processes_per_compute_node; // On Summit of ORNL, this constant is 84

OK, this makes sense to me. It is like creating virtual nodes with num_iotasks processors per node (in case there are very few processors per physical node).