Suggestion: Use the int representation of a product state in the calculations instead of the bytes/bitarray representation

JohanSchott commented 3 months ago

At the moment use bitarray from the bitarray package to represent a product state.

Store Hamiltonian matrix elements in sparse format in a dict. The product states are dict-keys so also need an immutable representation of a product state. Get this by using bitarray's .tobytes() method.

Instead of using bitarray one could use one Python built-in integer to represent a product state. It is immutable so no need to convert between two representations during calculations. It should be roughly as quick to change a state (removing or adding an electron) using the integer representation as using bitarray, since bit-shifts are used modify the integer representation (since https://github.com/JohanSchott/impurityModel/pull/47).

If use the integer representation in the calculations could remove bitarray as a representation of a product state.

johanjoensson commented 3 months ago

In my fork, where I calculate the selfenergy of the impurity, and do self-consistent DMFT, I have had to add a lot of MPI communication of product states. This change would be incompatible with basically all my changes.

Also, this might cause problems with large numbers of spin orbitals and MPI and numpy communication. Python's int representation has no fixed bit length, however all MPI and numpy types have. For instance, by default numpy uses long (which is system dependent) for python's int.

I would recommend using bytes/bytearray for representing states, instead of int.

JohanSchott commented 3 months ago

Great with your input @johanjoensson !

Happy to hear about your DMFT work. Cool!

It sounds like your code explicitly use the fact that bytes/bitarray is the product-state representation, true? I naively thought that the representation would not matter in most cases. Curious, can you show a short example where the int-representation would not work? If it would be a lot of work to change to int-representation it sounds better to continue using bytes/bitarray, since the potential gains (skip going back and forth between bytes and bitarray, and not dependent on bitarray package) are not very big.

I think you raised important points about numpy and MPI. In the short code script below it seems to me it's possible to send a big python int to a numpy array and then send it another MPI rank. But the numpy dtype becomes "object" so the potential benefit of storing several states in a numpy array (instead of a e.g. a python list) is gone I guess. Do you have a better example where numpy or MPI can't handle big ints (not even with the "object" dtype)? (I'm aware one will get nonsense if have a numpy array of dtype np.int64 and try to fill it will values bigger than 2**64-1.)

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.rank

if rank == 0:
    # Bigger than 2**64-1
    i = 2**80
    print(f"int: {rank = }, {i = }")
    # ndarray with "object" dtype, if input values are big enough to not fit in int64
    i = np.array([i])
    print(f"First ndarray: {rank = }, {i = }")
    # Add one to all "object" elements
    i += 1
    print(f"Second ndarray: {rank = }, {i = }")
    # Slow unbuffered communication
    comm.send(i, dest=1, tag=0)
elif rank == 1:
    i = comm.recv(source=0, tag=0)
else:
    i = 3

print(f"Final: {rank = }, {i = }")

johanjoensson commented 3 months ago

Yes, I found it easier to just force the DMFT calculation to use the bytes/bitarray representation.

The Python "Object" will always work, however it is only available when using the lower case mpi4py functions (send, recv, reduce, etc.). These functions start by pickling the data to be sent, and then sends the pickled object. For performance reasons, this becomes very slow when we need to communicate large (or many) objects.

Instead mpi4py recommends using the upper case mpi4py functions (Send, Recv, Reduce, etc.)1. These functions work with any python buffer type (numpy ndarray, bytearray, etc.), and are much more similar to the Fortranc/C MPI calls, in terms of performance at least. But, they require that your data can be reprented by MPI datatypes, and python Object is not a MPI datatype, unfortunately.

This would only become a problem for communicating large number of product states, which is rather unusual, I agree. However, if we run on a supercomputer with 100+ MPI-ranks (or even 1000+) it becomes necessary to not always store all product states in the basis on all ranks. Rather what I do is I distribute the product states over all the MPI ranks, which unfortunately means I need to be able to send product states between ranks. Since the size of the basis grows exponentially with the number of spin-orbitals, this quickly leads to massive amounts of data being sent over MPI, and the upper case functions are quite helpful then.

JohanSchott commented 3 months ago

Thanks for reminding me about the upper-case mpi4py functions.

Ok, now I see the advantage of using the bytes/bitarray representation vs the int representation: can have a numpy array of bytes elements but not a numpy array of big ints.

Let's close this issue (without any further action).

Is the script below similar to how you send states as a numpy array of bytes?

Small script:

from mpi4py import MPI
import numpy as np
from bitarray import bitarray
from time import time
import math

comm = MPI.COMM_WORLD
rank = comm.rank

n_spin_orbitals = 200
# One product state
ps = bitarray("1"*n_spin_orbitals).tobytes()
n_bytes = math.ceil(n_spin_orbitals / 8)
dtype = np.dtype(f"|S{n_bytes}")
# Number of product states
n_ps = 10**6

if rank == 0:
    states = tuple(ps for _ in range(n_ps))

# Slow unbuffered communication
comm.Barrier()
t_unbuff = time()
if rank == 0:
    comm.send(states, dest=1, tag=0)
elif rank == 1:
    states = comm.recv(source=0, tag=0)
comm.Barrier()
t_unbuff = time() - t_unbuff

# Fast buffered communication
if rank == 0:
    # First convert to ndarray
    t_convert = time()
    states_numpy = np.array(states, dtype=dtype)
    t_convert = time() - t_convert
    assert states_numpy.dtype == dtype

comm.Barrier()
t_buff = time()
if rank == 0:
    # Fast buffered communication
    comm.Send([states_numpy, MPI.CHAR], dest=1, tag=7)
elif rank == 1:
    # Initialize numpy array
    states_numpy = np.empty(n_ps, dtype=dtype)
    comm.Recv([states_numpy, MPI.CHAR], source=0, tag=7)
comm.Barrier()
t_buff = time() - t_buff

# Convert back to tuple
t_convert_back = time()
states = tuple(states_numpy)
t_convert_back = time() - t_convert_back

if rank == 0:
    print(f"{1000*t_unbuff = :.1f} ms")
    print(f"{1000*t_convert = :.1f} ms")
    print(f"{1000*t_buff = :.1f} ms")
    print(f"{1000*t_convert_back = :.1f} ms")
    print(f"{t_unbuff / t_buff :.1f}x speed-up using buffered communication")

Script output:

1000*t_unbuff = 66.2 ms
1000*t_convert = 112.2 ms
1000*t_buff = 20.5 ms
1000*t_convert_back = 141.6 ms
3.2x speed-up using buffered communication

It looks like the unbuffered communication is about 3x slower than the buffered one. And that converting between numpy array and tuple takes more time than the MPI communication (on my 11 years old laptop).

johanjoensson commented 3 months ago

I would not recommend using numpy arrays for storing python bytes, in my experience numpy is clever, and trims away null bytes which is absolutely not what we want. That caused me massive headaches for a long time. However, you can use bytearrays for receiving MPI.BYTES (that way I think you would avoid weird endianness-related issues). Apart from using bytearrays instead of numpy arrays, you example is very close to how I send/receive states.

On my 1.5 years old laptop, I get a similar speedup.

JohanSchott commented 3 months ago

I have zero experience of using numpy to store bytes so thanks for the information.

Do you join all states into one bytearray with something like bytearray(b"".join(states)) ?

johanjoensson commented 3 months ago

I can usually get away with setting up a generator that outputs the appropriate slices from the states-array, something like (bytes(states[i:i+n_bytes]) for i in range(0, n_ps, n_bytes)), or itertools.chunked if available.

I try to only use bytearray when receiving a bunch of MPI.BYTE, then chunking it up into appropriate bytes when I need the actual product states.

A very large change that I have done in my fork is that I have created a class for storing the manybody basis, so this class is responsible for doing all the MPI-distributing of product states, I use it to switch between vector/matrix representations of multiconfigurational states/operators, expanding the basis by repeated application of the Hamiltonian, etc.

JohanSchott commented 3 months ago

Aha, ok.

Interesting with a class to store the product states included the basis.

JohanSchott / impurityModel

Suggestion: Use the int representation of a product state in the calculations instead of the bytes/bitarray representation #54