comex msg collectives should not do comex_barrier

In the original ARMCI source code, ARMCI msg collectives were equivalent to MPI when MPI was used.

In ARMCI (armci/src/collectives/message.c):

void parmci_msg_barrier()
{
#ifdef BGML
  bgml_barrier (3); /* this is always faster than MPI_Barrier() */
#elif defined(MSG_COMMS_MPI)
     MPI_Barrier(ARMCI_COMM_WORLD);
#  elif defined(PVM)
     pvm_barrier(mp_group_name, armci_nproc);
#  elif defined(LAPI)
#if !defined(NEED_MEM_SYNC)
     if(_armci_barrier_init)
       _armci_msg_barrier();
     else
#endif
     {
       tcg_synch(ARMCI_TAG);
     }
#  else
     {
        tcg_synch(ARMCI_TAG);
     }
#  endif
}

Now, in Comex (comex/src-armci/message.c):

void parmci_msg_barrier()
{
    comex_barrier(ARMCI_Default_Proc_Group);
    MPI_Barrier(get_default_comm());
}

comex_barrier is an expensive O(n) operation. It is a nontrivial overhead to add to parmci_msg_barrier, particularly since the most common use of this operation in GA is immediately after ARMCI_AllFence, which does all of the synchronization that comex_barrier does.

GlobalArrays / ga

comex msg collectives should not do comex_barrier #326