Reimplemented particles to use AoS with block communication

I have restructured the particles class. Changes include:

AoS is now the canonical way to represent particles. Particle arrays are always allocated with 64-byte-aligned memory so that every particle (8 doubles, i.e. 64 bytes) occupies a 64-byte cache line.
There is no longer any imposed limit on the number of particles that can be communicated. Particles are stored in Larray buffers (i.e. std::vector) and are added with calls like pcl_buffer.push_back(pcl), which automatically reallocates the buffer as necessary.
I use processor-independent application of boundary conditions to ensure that particle communication completes within 2*(XLEN+YLEN+ZLEN) iterations of communication.
I enforce boundary conditions by calling virtual methods such as apply_Xrght_BC(), which takes a list of particles that need the boundary conditions for the right edge of the domain to be applied. It is ultimately intended that the user will inherit from the particle solver and override this method when appropriate in order to implement user-defined boundary conditions.
I implemented BCs via MPI self-communication. This is a coding shortcut that could be eliminated if this turns out to be a problem. Note that for the GEM problem, to avoid lots of communication in the periodic z direction and accelerate convergence of the field solver, you should make Lz large (e.g. the same as Lx and Ly).

In the process, I also did the following:

I created pclIDgenerator class for particle IDs.
I used double precision rather than long long to represent particle IDs.
I implemented support for nxc/XLEN to be noninteger. (This has not yet been tested.)
I implemented a fast 8x8 transpose for the MIC and used it to convert between AoS and SoA pcls. This could be extended to Xeon by implementing the same method with AVX 256-bit intrinsics instead of MIC 512-bit instrinsics.

Internal changes to the code include:

I eliminated the distinction between processor topologies of fields and particles. This distinction was never properly made. If we do this, we should first separate the particle and field solvers.
I consolidated random sampling code so that there is a single point in the code (ipicmath.h) that samples from a Maxwellian distribution or unit interval.
Particles3D::particle_repopulator() is now much more efficient. Instead of traversing the list of particles six times, deleting and repopulating particles with each pass, the list is now traversed once to delete particles and repopulated particles are then created and added at the end of the list.

CmPA / iPic3D