The first evaluation of the einsum function takes longer

FarhadK-1QBit commented 10 months ago

The first call of the einsum function in the calculate_boxqp_grad or calculate_drift_boxqp methods takes longer compared to the next calls. This will be shown when multiple boxQP instances are solved in the call of the solver. This also applies to the first time the post-processing gradient descent method is called.

Estella-Wang-1qbit commented 6 months ago

I’ve investigated this using torch.compile, with changes made in the bugfix/fix_einsum_evaluation_time branch. The modifications involve attempting to compile the einsum function before it's called in_calculate_drift_boxqp inside DL solver.

To test the code:

Run python ccvm_boxqp_dl.py inside the examples directory.
You'll notice a series of print statements showing the time used for the einsum function. We want to pay attention to the first time and compare it to the rest to see if it's taking longer than usual.

Result:

In my experiment, the first run takes significantly longer than subsequent runs. This indicates that the changes I've made so far haven't resolved the issue, and further investigation is required.

Things to note:

To use torch.compile, you need a Torch version equal to or above 2.0 (torch >= 2.0). I updated the requirements.txt, but I still encountered the following error when running the code on my local MacBook running macOS:

OpenMP support not found. Please try one of the following solutions:
(1) Set the `CXX` environment variable to a compiler other than Apple clang++/g++ that has built-in OpenMP support;
(2) install OpenMP via conda: `conda install llvm-openmp`;
(3) install libomp via brew: `brew install libomp`;
(4) manually setup OpenMP and set the `OMP_PREFIX` environment variable to point to a path with `include/omp.h` under it.

To resolve this, I opted for option 3 above, which took over 1 hour to install the relevant dependencies. However, afterwards, I was able to run torch.compile.

Estella-Wang-1qbit commented 6 months ago

Alternative workaround: If you want accurate timing for measuring and benchmarking, and you're bothered by the issue of the first run taking longer, consider adding a dummy call to the einsum function before the actual (real) call.

Implementation details:

For example, in the DL solver, insert the following code before starting the timer in the __call__ function:

        # Start a dummy einsum before the timer
        my_q_scaled = torch.ones(problem_size, problem_size)
        my_c_scaled = torch.ones(problem_size)
        torch.einsum("bi,ij -> bj", torch.zeros(batch_size, problem_size).to(device), my_q_scaled) + my_c_scaled

However, note that this workaround still requires further investigation. In some cases, such as when running python ccvm_boxqp_dl.py inside the examples directory, it may not work as expected.

1QB-Information-Technologies / ccvm

The first evaluation of the einsum function takes longer #129