QuEST-Kit / QuEST

A multithreaded, distributed, GPU-accelerated simulator of quantum computers
https://quest.qtechtheory.org/
MIT License
392 stars 133 forks source link

Can Quest run on Apple M1 ? #301

Open keithyau opened 2 years ago

keithyau commented 2 years ago

Wondering if llvm / Clang can be supported Apple M1

TysonRayJones commented 2 years ago

Hi there,

I don't have an M1 handy to test, but certainly there's nothing special in the QuEST architecture to preclude it. I would confidently assume that serial QuEST is supported by whatever the M1 compiling chain is.

For multithreading; QuEST supports OpenMP versions 2.0 (in develop - the master branch temporarily requires 3.1) through to OpenMP 5.0 (the latest). It is not yet tested with 5.1, but is expected compatible. Mature releases of Clang support OpenMP (e.g. OpenMP 4.5 in Clang 13). If the M1 compiling chain fully supports clang, then I expect QuEST to compile fine.

But one never knows until they test!

keithyau commented 2 years ago

thank you !

mmoelle1 commented 2 years ago

Hi there,

I tried compiling QuEST on an M1 and it works. However, it needs some modification of the CMakeLists.txt file.

Original (same for C++ compiler):

# TODO standardize
# set C compiler flags based on compiler type
if ("${CMAKE_C_COMPILER_ID}" STREQUAL "Clang")
  # using Clang
  set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} \
    -mavx -Wall"
  )
elseif ("${CMAKE_C_COMPILER_ID}" STREQUAL "GNU")
  # using GCC
  set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} \
    -mavx -Wall"
  )
elseif ("${CMAKE_C_COMPILER_ID}" STREQUAL "Intel")
  # using Intel
  set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} \
    -fprotect-parens -Wall -xAVX -axCORE-AVX2 -diag-disable cpu-dispatch"
  )
elseif ("${CMAKE_C_COMPILER_ID}" STREQUAL "MSVC")
  # using Visual Studio
  string(REGEX REPLACE "/W3" "" CMAKE_C_FLAGS ${CMAKE_C_FLAGS})
  string(REGEX REPLACE "-W3" "" CMAKE_C_FLAGS ${CMAKE_C_FLAGS})
  set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} \
    -w"
  )
endif()

Apple's default compiler reports itself as AppleClang so by accident no -mavx flag is set which does not work on M1. However, when you install a true GCC (e.g. using homebrew), the about detects a GNU compiler and sets -mavx which leads to a compiler error. The same problem happens on any non-x86_64 architecture (ARM/ARM64, PPC).

As a quick fix I'd suggest to wrap the entire if()...endif() block in if (CMAKE_SYSTEM_PROCESSOR MATCHES "(x86)|(X86)|(amd64)|(AMD64)") ... endif() which will disable it for any non-x86_64 architecture.

TysonRayJones commented 2 years ago

Hi Matthias, That's really useful to know, thanks very much! I've been meaning to test whether QuEST can meaningfully utilise auto-vectorisation for a while, so I'll add that to my backlog and update the build afterward (or remove the flag entirely). @rrmeister who has a better understanding of the CMake build may also be interested. Thanks again!

ekapit commented 2 years ago

I just got a new M1 Max laptop, and am trying out QuEST on it. Naively, it should be extremely fast-- this CPU has 10 cores and 200+ GB/s usable memory bandwidth, higher than most Xeons, and since that's the primary bottleneck it should be very quick. And I was able to get Apple clang to link to openMP correctly, so it is multithreaded. However when trying it out it ends up being much slower than on intel chips. I tried setting "march=apple-m1" as a compiler flag to make sure it's compiling native code but that didn't seem to change anything. I strongly suspect this is a compiler issue, though I'm not sure what to try next.

Has anyone gotten QuEST to perform well on Apple Silicon?

TysonRayJones commented 2 years ago

Hi ekapit,

Hmm that's quite puzzling. I've created a very simple MWE below which modifies a complex array much like QuEST's backend CPU code.

Let's first test if your laptop is performing as expected for a serial simulation. Can you copy the code below into a file (e.g. github_issue.c), and compile it serially using -O3 optimisation, and whatever additional arguments you need to target M1?

On my 13-inch Macbook, I compiled via

clang github_issue.c -O3 -o test

using clang-10. It ran (./test) in 12s.

In what time does your M1 laptop run?

MWE


/* compile as...
 *  serial:
 *      clang github_issue.c -O3 -o test
 *  multithreaded:
 *      clang github_issue.c -O3 -openmp -o test
 *
 * run as...
 *     export OMP_NUM_THREADS=1
 *     ./test
 *
 * Memory cost = 16 * 2^numQb (bytes)
 *      20 qubits = 16 MiB
 *      28 qubits = 4 GiB
 *
 * Serial simulation of 28 qubits on my 13-inch Macbook Pro,
 * compiled with clang-1000.10.44.2:
 *      12.133904 (s)
 */

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <complex.h>
#include <sys/time.h>

#ifdef _OPENMP
#include <omp.h>
#endif

#define START_TIMING() \
    struct timeval tval_before, tval_after, tval_result; \
    gettimeofday(&tval_before, NULL);

#define STOP_TIMING() \
    gettimeofday(&tval_after, NULL); \
    timersub(&tval_after, &tval_before, &tval_result); \
    printf("%ld.%06ld (s)\n", \
        (long int) tval_result.tv_sec, \
        (long int) tval_result.tv_usec);

typedef long long unsigned int INDEX;

typedef double complex AMP;

void applyGate(AMP* amps, int t, int numQb) {

    const double fac = 1/sqrt(2);
    const INDEX iNum = (1ULL << numQb) >> 1;

#ifdef _OPENMP
#pragma omp parallel \
    default  (none) \
    shared   (amps,t,numQb, fac,iNum) \
    private  (i,j,j0k,j1k,a1,a2)
#endif
    {
#ifdef _OPENMP
#pragma omp for schedule (static)
#endif
        for (INDEX i=0; i<iNum; i++) {

            // |0>|i> -> |j>|0>|k>, |j>|1>|k>
            INDEX j = (i >> t) << t;
            INDEX j0k = (j << 1ULL) ^ (i - j);
            INDEX j1k = j0k ^ (1ULL << t);

            AMP a1 = amps[j0k];
            AMP a2 = amps[j1k];
            amps[j0k] = fac*a1 + fac*a2;
            amps[j1k] = fac*a1 - fac*a2;
        }
    }
}

int main() {

    int numQb = 28;

    INDEX numAmp = (1ULL<<numQb);
    AMP* amps = malloc(numAmp * sizeof *amps);
    for (INDEX i=0; i<numAmp; i++)
        amps[i] = 1./i + 2.*I/i;

    START_TIMING()

    for (int t=0; t<numQb; t++)
        applyGate(amps, t, numQb);

    STOP_TIMING()

    free(amps);
    return 0;
}
mmoelle1 commented 2 years ago

Hi Tyson,

I tried you code on my Apple M1 (MacBook Pro) not the M1 Max or Pro as the OP.

Apple clang version 13.0.0 (clang-1300.0.29.30)
Target: arm64-apple-darwin21.3.0

Serial

❯ clang github_issue.c -O3 -o test
8.559273 (s)

OpenMP

❯ clang github_issue.c -O3 -openmp -o test
7.743996 (s) OMP_NUM_THREADS=1
4.227490 (s) OMP_NUM_THREADS=2
4.195969 (s) OMP_NUM_THREADS=4
4.211792 (s) OMP_NUM_THREADS=8

GCC 11.2.0.3 (from home-brew)

Serial

7.596629 (s)

OpenMP

7.835089 (s) OMP_NUM_THREADS=1
5.348674 (s) OMP_NUM_THREADS=2
5.083343 (s) OMP_NUM_THREADS=4
5.096947 (s) OMP_NUM_THREADS=8

For GCC the line private (i,j,j0k,j1k,a1,a2) needs to be removed.

TysonRayJones commented 2 years ago

Thanks very much Matthias! (and oops regarding GCC; I forgot we have to pre-declare our OpenMP variables there like filthy animals).

Those are encouraging times, which to me confirm ekapit's performance issues are indeed related to build parameters, as we discussed above. Or maybe we're comparing to some very impressive Intel chips! :)

fieldofnodes commented 1 year ago

Hi, I have an M1 Max macbook pro and I just added #346 to this as I can not get QuEST to make for testing.

TysonRayJones commented 3 weeks ago

Confirming QuEST v4 (due for release mid-September) runs fine on an M1 Mac (which is now my main development machine!), with a naive build. We'll make sure our revised CMake build avoids the above issues.