Alternate implementation of MaxConcurrentIO parameter

On the FLAMINGO 10k run I've been finding that if not all ranks are allowed to read at the same time then the code is very slow. I think this might be because if the system is busy and a few ranks suffer long delays then the others are forced to wait. The current implementation divides the MPI ranks into groups and only one group at a time may read. None of the ranks in the next group can start until ALL ranks in the previous group finish.

This pull request modifies the code so that as soon as any one rank finishes reading another is immediately allowed to start. This is implemented by having the first rank which finishes reading become responsible for signalling the others to start.

SWIFTSIM / HBTplus

Alternate implementation of MaxConcurrentIO parameter #41