GlobalArrays / ga

Partitioned Global Address Space (PGAS) library for distributed arrays
http://hpc.pnl.gov/globalarrays/
Other
101 stars 38 forks source link

MPI Error: ../../ga-5.7/comex/src-mpi-pr/comex.c: line 4297: DEFAULT #141

Closed jarrah42 closed 5 years ago

jarrah42 commented 5 years ago

I'm trying to run a problem in nwchem, but it is failing with what looks like an error in the GA code. The code is being run on Titan (a Cray XK7). I opened this issue with nwchem but they directed me here. I've included an abbreviated log below. Any help would be appreciated.

                  NWChem Extensible Many-Electron Theory Module
                   ---------------------------------------------

              ======================================================
                   This portion of the program was automatically
                  generated by a Tensor Contraction Engine (TCE).
                  The development of this portion of the program
                 and TCE was supported by US Department of Energy,
                Office of Science, Office of Basic Energy Science.
                      TCE is a product of Battelle and PNNL.
              Please cite: S.Hirata, J.Phys.Chem.A 107, 9887 (2003).
              ======================================================

                                E and Grad of 1UBQ

            General Information
            -------------------
      Number of processors :   960
         Wavefunction type : Restricted Hartree-Fock
          No. of electrons :   292
           Alpha electrons :   146
            Beta electrons :   146
           No. of orbitals :   848
            Alpha orbitals :   424
             Beta orbitals :   424
        Alpha frozen cores :     0
         Beta frozen cores :     0
     Alpha frozen virtuals :     0
      Beta frozen virtuals :     0
         Spin multiplicity : singlet 
    Number of AO functions :   424
       Number of AO shells :   272
        Use of symmetry is : off
      Symmetry adaption is : off
         Schwarz screening : 0.10D-09

          Correlation Information
          -----------------------
          Calculation type : Coupled-cluster singles & doubles                           
   Perturbative correction : none                                                        
            Max iterations :      100
        Residual threshold : 0.10D-02
     T(0) DIIS level shift : 0.00D+00
     L(0) DIIS level shift : 0.00D+00
     T(1) DIIS level shift : 0.00D+00
     L(1) DIIS level shift : 0.00D+00
     T(R) DIIS level shift : 0.00D+00
     T(I) DIIS level shift : 0.00D+00
   CC-T/L Amplitude update :  5-th order DIIS
                I/O scheme : Global Array Library
        L-threshold :  0.10D-02
        EOM-threshold :  0.10D-02
 no EOMCCSD initial starts read in
 hftype RHF 
 TCE RESTART OPTIONS
 READ_INT:    F
 WRITE_INT:   T
 READ_TA:     F
 WRITE_TA:    F
 READ_XA:     F
 WRITE_XA:    F
 READ_IN3:    F
 WRITE_IN3:   F
 SLICE:       F
 D4D5:        F

            Memory Information
            ------------------
          Available GA space size is    ********** doubles
          Available MA space size is      87250708 doubles

 Maximum block size supplied by input
 Maximum block size        40 doubles

 tile_dim =     40

 Block   Spin    Irrep     Size     Offset   Alpha
 -------------------------------------------------
   1    alpha     a     36 doubles       0       1
   2    alpha     a     37 doubles      36       2
   3    alpha     a     36 doubles      73       3
   4    alpha     a     37 doubles     109       4
   5    beta      a     36 doubles     146       1
   6    beta      a     37 doubles     182       2
   7    beta      a     36 doubles     219       3
   8    beta      a     37 doubles     255       4
   9    alpha     a     39 doubles     292       9
  10    alpha     a     40 doubles     331      10
  11    alpha     a     40 doubles     371      11
  12    alpha     a     39 doubles     411      12
  13    alpha     a     40 doubles     450      13
  14    alpha     a     40 doubles     490      14
  15    alpha     a     40 doubles     530      15
  16    beta      a     39 doubles     570       9
  17    beta      a     40 doubles     609      10
  18    beta      a     40 doubles     649      11
  19    beta      a     39 doubles     689      12
  20    beta      a     40 doubles     728      13
  21    beta      a     40 doubles     768      14
  22    beta      a     40 doubles     808      15

 Global array virtual files algorithm will be used

 Parallel file system coherency ......... OK
 size_1e                   179776
  0 ga offset                   0 size_xx_perproc               44944mx    4
 WRITE TENSOR
  filename: /lustre/atlas/scratch/gw6/csc297/nwc_gbe_631g.001024.4515638/nwc_gbe_dat.f1int.0
  unit nr:       77
  1 ga offset               44944 size_xx_perproc               44944mx    4
  file size:          44944
  rec_mem (KB):     2048
  rec_size:         262144
  number of tasks:            1
  3 ga offset              134832 size_xx_perproc               44944mx    4
  2 ga offset               89888 size_xx_perproc               44944mx    4

 Fock matrix recomputed
 1-e file size   =           179776
 1-e file name   = .../nwc_gbe_631g.001024.4515638/nwc_gbe_dat.f1int.000000
 Cpu & wall time / sec            1.4            1.4
 4-electron integrals stored in orbital form

 v2    file size   =       4882398943
 4-index algorithm nr.  15 is used
 imaxsize =       45
 imaxsize ichop =        0
 starting step 1 at                49.00 secs 
 starting step 2 at               125.62 secs 
 starting step 3 at               138.14 secs 
 starting step 4 at               148.25 secs 
 done step 4 at               162.91 secs 
  1 ga offset          1220599735 size_xx_perproc          1220599735mx    4
  2 ga offset          2441199470 size_xx_perproc          1220599735mx    4
  0 ga offset                   0 size_xx_perproc          1220599735mx    4
 WRITE TENSOR
  filename: .../nwc_gbe_631g.001024.4515638/nwc_gbe_dat.v2int.0
  unit nr:      178
  file size:     1220599735
  rec_mem (KB):     2048
  rec_size:         262144
  number of tasks:         4657
  3 ga offset          3661799205 size_xx_perproc          1220599738mx    4
p[1] ERROR [strided_to_subarray_dtype]
stride: 1
stride_array[0]: 8
array_of_sizes[0]: 8
array_of_subsizes[0]: 2095808
p[1] Error forming MPI_Datatype for one-sided strided operation. Check that stride dimensions are compatible with local block dimensions
p[1] count[0]: 2095808 stride[0]: 8
p[1] count[1]: 1
p[1] Error in strided_to_subarray_dtype:MPI_Type_create_subarray
p[1] MPI_Error: Invalid argument, error stack:
MPI_Type_create_subarray(344): MPI_Type_create_subarray(ndims=2, array_of_sizes=0x7fffffff1e50, array_of_subsizes=0x7fffffff1e70, array_of_starts=0x7fffffff1e90, order=57, MPI_BYTE, newtype=0x7fffffff1f14) failed
MPI_Type_create_subarray(113): Argument array_of_subsizes has value 2095808 but must be within [0,8]
p[1] Error in nb_gets_datatype:MPI_Type_commit
p[1] MPI_Error: Invalid datatype, error stack:
PMPI_Type_commit(131): MPI_Type_commit(datatype_p=0x7fffffff1f14) failed
PMPI_Type_commit(90).: Invalid datatype
{1} MPI Error: ../../ga-5.7/comex/src-mpi-pr/comex.c: line 4297: DEFAULT
p[2] ERROR [strided_to_subarray_dtype]
stride: 1
stride_array[0]: 8
array_of_sizes[0]: 8
array_of_subsizes[0]: 2096248
p[2] Error forming MPI_Datatype for one-sided strided operation. Check that stride dimensions are compatible with local block dimensions
p[2] count[0]: 2096248 stride[0]: 8
p[2] count[1]: 1
p[2] Error in strided_to_subarray_dtype:MPI_Type_create_subarray
p[2] MPI_Error: Invalid argument, error stack:
MPI_Type_create_subarray(344): MPI_Type_create_subarray(ndims=2, array_of_sizes=0x7fffffff1e50, array_of_subsizes=0x7fffffff1e70, array_of_starts=0x7fffffff1e90, order=57, MPI_BYTE, newtype=0x7fffffff1f14) failed
MPI_Type_create_subarray(113): Argument array_of_subsizes has value 2096248 but must be within [0,8]
p[2] Error in nb_gets_datatype:MPI_Type_commit
p[2] MPI_Error: Invalid datatype, error stack:
PMPI_Type_commit(131): MPI_Type_commit(datatype_p=0x7fffffff1f14) failed
PMPI_Type_commit(90).: Invalid datatype
{2} MPI Error: ../../ga-5.7/comex/src-mpi-pr/comex.c: line 4297: DEFAULT
Rank 1 [Wed Feb  6 16:20:24 2019] [c4-1c0s4n3] application called MPI_Abort(comm=0x84000002, 472525059) - process 1
Rank 2 [Wed Feb  6 16:20:24 2019] [c4-1c0s4n3] application called MPI_Abort(comm=0x84000002, 539633923) - process 2
p[3] ERROR [strided_to_subarray_dtype]
stride: 1
stride_array[0]: 8
array_of_sizes[0]: 8
array_of_subsizes[0]: 2096688
p[3] Error forming MPI_Datatype for one-sided strided operation. Check that stride dimensions are compatible with local block dimensions
p[3] count[0]: 2096688 stride[0]: 8
p[3] count[1]: 1
p[3] Error in strided_to_subarray_dtype:MPI_Type_create_subarray
p[3] MPI_Error: Invalid argument, error stack:
MPI_Type_create_subarray(344): MPI_Type_create_subarray(ndims=2, array_of_sizes=0x7fffffff1e50, array_of_subsizes=0x7fffffff1e70, array_of_starts=0x7fffffff1e90, order=57, MPI_BYTE, newtype=0x7fffffff1f14) failed
MPI_Type_create_subarray(113): Argument array_of_subsizes has value 2096688 but must be within [0,8]
p[3] Error in nb_gets_datatype:MPI_Type_commit
p[3] MPI_Error: Invalid datatype, error stack:
PMPI_Type_commit(131): MPI_Type_commit(datatype_p=0x7fffffff1f14) failed
PMPI_Type_commit(90).: Invalid datatype
{3} MPI Error: ../../ga-5.7/comex/src-mpi-pr/comex.c: line 4297: DEFAULT
Rank 3 [Wed Feb  6 16:20:24 2019] [c4-1c0s4n3] application called MPI_Abort(comm=0x84000002, 808069379) - process 3
_pmiu_daemon(SIGCHLD): [NID 02263] [c4-1c0s4n3] [Wed Feb  6 16:20:25 2019] PE RANK 2 exit signal Aborted
[NID 02263] 2019-02-06 16:20:25 Apid 19661405: initiated application termination
Application 19661405 exit codes: 134
Application 19661405 exit signals: Killed
Application 19661405 resources: utime ~288s, stime ~1191s, Rss ~1265644, inblocks ~4228739, outblocks ~14931009
jeffhammond commented 5 years ago

@edoapra corrected my false assumption that this was a GA bug. It was a bug in NWChem that is fixed already.